Introduction To Geographic Data and Quiz 2 Information

Author

Penelope Pooler Eisenbies

Published

March 23, 2026

Housekeeping

Upcoming Dates

  • HW 5 - Part 1 is due tomorrow, 3/25

  • Rough Draft Proposals are due on Thursday, 3/26 at 6:00 PM

  • Project presentations will take place on April 23rd in class (4 weeks).

  • HW 5 - Part 2 will be posted after Quiz 2.

  • Quiz 2 is on Tuesday 4/7

    • Mostly on Weeks 5 through 9, but material is cumulative.

    • Mostly on HW Assignments 4 and 5, but material is cumulative.

    • Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions.

Next Few Lectures

  • Practice Questions and Demo Videos for Quiz 2 are available.

  • Next Tuesday: Skills/Concepts Review for Quiz 2

  • Putting skills together for different goals

  • Come with questions or submit them in Engagement Questions

  • Intro to Managing and Plotting Geographic Data

    • Today’s geographic maps will not be on Quiz 2.

    • Mapping data is an effective data curation tool that may be useful in this class, other classes, your career.

  • We will cover more about geographic data and map visualizations in upcoming lectures.

  • If you want help with mapping project data, please reach out to me.

In-class Exercise - Week 10

Purpose:

  • To gain some experience and understanding of map data available in R and elsewhere.

  • To experiment with mapping data

  • Students are encouraged to use domestic or international map data in their dashboards if appropriate.

Data Preparation

  • The data for today’s in-class exercise is part of R.

  • These geographic data are useful if you have information by state or county and you want to show a choropleth map of your data.

  • R also has world information, e.g., countries, continents, etc.

Code
```{r us data prep}
us_states <- map_data("state") |>               # state polygons (not used today)
  rename("state" = "region")
us_counties <- map_data("county") |>            # county polygons
  rename("state" = "region", "county" = "subregion") |>
  mutate(county = gsub(" ", "", county), 
         county = gsub("'","", county) |> tolower())
#unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties
cnty2019_all <- county_2019
#unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties

cnty2019_all <- cnty2019_all |> 
  mutate(state = tolower(state),
         county = tolower(name),
         county = gsub(" county", "", county),
         county = gsub(" parish", "", county),
         county = gsub("\\.", "", county),          # \\ is required because . used in R coding
         county = gsub(" ", "", county),
         county = gsub("'","", county)) |>
  relocate(county, .before=name)

cnty2019_all <- full_join(us_counties,cnty2019_all)  # geo data and demographic data
```
Joining with `by = join_by(state, county)`

County Demographic Plots

  • Creating a new dataset not required, but helpful

  • Plot code could be converted into a function

Code
```{r}
#|label: select data and plot 1 map
cnty_data1 <- cnty2019_all |> 
  select(long:county, hs_grad, bachelors, 
         household_has_computer, 
         household_has_broadband) 
cnty_hs_grad <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, fill=hs_grad)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", 
         title="Percent with High School Degree") +
    scale_fill_continuous(type = "viridis") 
```

Code
```{r}
#|label: plot 2 code
cnty_bachelors <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=bachelors)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with Bachelor's Degree") +
    scale_fill_continuous(type = "viridis")
```
Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
ℹ Please use the `linewidth` argument instead.

Code
```{r}
#|label:  plot 3 code
cnty_computer <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_computer)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with A Computer")+
    scale_fill_continuous(type = "viridis")
```

Code
```{r plot 4 code}
cnty_brdbd <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_broadband)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with BroadBand") +
    scale_fill_continuous(type = "viridis")
```

Demographic Plot Grid - Order Matters

Grid is populated left to right and then top to bottom unless otherwise specified.

Code
```{r eval=F}
#|label: grid of pct plots
grid.arrange(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2)
```

In-class Exercise - Plot Grid (Steps 1 & 2)

Steps to Follow

  1. Examine the available variables in the cnty_2019_all dataset saved from R.

  2. Create Individual Plots**.

  • Variable names in R dataset and definitions:

    • household_has_smartphone: Households with Smart Phones
    • median_age: Median Age
    • median_household_income: Median Household Income
    • median_individual_income: Median Individual Income

You can copy provided plot code and modify it for these variables or you can try to create a function (not required today).

Code
```{r}
#|label: usmaps demographic maps exercise
# select data for plot (not required, but helpful)
# cnty_data2 <- 

# create four individual plots, one for each variable
# use provided plot code and modify
```

In-class Exercise - Plot Grid (Steps 3 & 4)

  1. Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week.
  • For full credit, plots must be in the order specified.

    • Row 1: should have smartphone variable and median age.
    • Row 2: median household and individual income.
  • Create Plot Grid using grid.arrange

Code
```{r}
#|label: in-class exercise 2x2 plot grid
# for full credit grid of plots must be in order specified
# border around all four plots not required
```
  1. Right click on plot grid, then save as… and save to img folder with correct name.
  • Note: ggsave will not work for grids created by grid.arrange.

Mapping Log Transformed Data

  • The previous examples above are all percent data.

  • No transformations are needed

  • In contrast, population data or financial data are often right-skewed and need to be log transformed.

  • Recall from MAS 261, BUA 345 (and perhaps FIN classes):

    • An effective transformation for right skewed data is the natural log (LN) transformation.

    • The following demo shows how useful it is for mapping right skewed data.

Plot of Skewed Data

Code
```{r}
#|label: untransformed pop. map code

cnty_data3 <- cnty2019_all |> 
  select(long:county, pop) |>
  mutate(pop1k = pop/1000)
       
cnty_pop <- cnty_data3 |>
    ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People") +
    scale_fill_continuous(type = "viridis") +
    theme(legend.position = "bottom",
          legend.key.width = unit(1, "cm"))
```

Histogram Clarifies Data Skewness

Code
```{r}
#|label: untransformed pop hist code
hist_pop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Population", title="Histogram of US Population Data",
       y = "Count",subtitle="Unit is 1000 People") +
  theme_classic()
```

The Problem: We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Solution - Natural Log Transformed Plot

  • Data are not transformed, but data axes and scale are

  • Options specified in scale_fill_continuous:

    • trans = "log"
    • breaks = c(...)
  • break intervals determined by examining data

Code
```{r}
#|label: log transformed map plot code
cnty_lpop <- cnty_data3 |>
  ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People and Date are Log-transformed") +
    scale_fill_continuous(type = "viridis",trans="log",
                          breaks=c(1,10,100,1000,10000)) +
    theme(legend.position = "bottom",
          legend.key.width = unit(1, "cm"))
```

Histogram of Log transformed Data

  • Final dashboard doesn’t include exporatory plots
  • Data exploration plots:
    • Histograms, Scatterplots and boxplots are all useful
Code
```{r}
#|label: log transformed pop hist code
hist_lpop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") +
  labs(x="Population", 
       title="Histogram of Natural Log of US Population Data",
       y = "Count",
       subtitle="Unit is 1000 People and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000))
```

Histogram of Log-transformed Data

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Population Plot Grid

Code
```{r eval=F}
#|label: plot code for pop grid
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) 
```
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

When and How to Log Transform

  • Log transformation are useful if you have right skewed POSITIVE data such as

    • Prices
    • Population
    • Sales
    • Income
  • Note: If data (x) have zeros, a good option is to use log(x + 1)

    • ln(1) = 0 (In R log(1) = 0)

    • 0 values in the data will still be zeros

  • In the following example we will create plots for number of households by county:

    • Histograms with and without LN transformation

    • Map plots of with and without LN transformation

Number of Households Per County - Without Transformation

Untransformed Data Histogram

Code
```{r}
#|label: data and untransformed hh data histogram code
#|
cnty_data4 <- cnty2019_all |> 
  select(long:county, households) |>
  mutate(households1K = households/1000)

hist_hholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", title="Histogram of US Household Data",
       y = "Count",subtitle="Unit is 1000 Households") +
  theme_classic()
```

Number of Households is highly right-skewed.

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Number of Households Per County - With Transformation

Log transformed Data Histogram

Code
```{r}
#|label: log transformed hh data histogram code
hist_lhholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", 
       title="Histogram of Natural Log of US Household Data",
       y = "Count",
       subtitle="Unit is 1000 Households and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000))
```

Log-transformed data appear normally distributed

`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Number of Households Per County

Code
```{r untransformed hh plot map code}
cnty_hholds <- cnty_data4 |>
  ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
  geom_polygon() +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  labs(fill= "", title="Number of Households by County",
       subtitle="Unit is 1000 Households") +
  scale_fill_continuous(type = "viridis") +
  theme(legend.position = "bottom",
        legend.key.width = unit(1, "cm"))
```

Map of untransformed households per county data is uninformative.

In-class Exercise 2

Log-transformed data map is more more informative about geographic variability.

Submit R code to create log transformed households map in a text file with your name.

Households per County Plot Grid

Code
```{r household grid plot code, eval=F}
grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)
```
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value `binwidth`.
Warning: Removed 212 rows containing non-finite outside the scale range
(`stat_bin()`).

Quiz 2 Information

  • Questions and Material from Quiz 1 may be on Quiz 2

  • Practice Questions and demo videos are posted.

    • Review Quiz 1 and Quiz 1 Practice Questions

    • Review Week HW 4 and HW 5 - Part 1 and recent lectures

  • Converting text (character) date information to a date using lubridate commands (Week 5)

    • Example R commands:ymd, dmy, mdy, ym combined with paste to combine columns
  • Extracting year, month, or day from the date variable using lubridate commands (Week 6)

    • Example R commands: year, month, quarter, wday, day
  • Converting an xts dataset to a tibble and vise-versa

    • Creating a lineplot from time series (non-xts) dataset

    • Creating an interactive hchart or dygraph from an xts.

Quiz 2 Info Continued

  • You should be familiar with the bls_tidy function we created and how to use it to import similar datasets.

  • There will be datasets to be imported AND combined (joined)

    • Data sets can be joined by row.

    • You should know how to do the different joins we covered and what each one does:

      • full_join

      • right_join

      • left_join

      • inner_join

  • Data sets can be stacked by column if the columns are identical

    • In BUA 455 we covered bind_rows

Quiz 2 Info Continued

  • Cleaning messy data (Week 5 and Weeks 8-9)

    • Dealing with text (character variables)

      • gsub

      • separate

      • unite or paste or paste0

      • ifelse can be be used for text OR for numeric data

      • ifelse followed by factor allows you to make any categorical variable you want.

  • Other commands for modifying text:

    • tolower and toupper

    • str_trim, str_squish and str_pad

    • ifelse sometimes followed by factor

Additional Text Commands

NOT ON QUIZ 2:

Additional skills for Quiz 2 from HW 5 - Part 1:

  • summing across rows using sum(c_across(...))

  • using pivot_wider and then pivot_longer and then replacing NAs with 0 to create a ‘complete’ data set

    • Useful for area plots
  • Plotting skills for Quiz 2

    • unformatted line plot, area plot, or grouped bar chart

    • hchart in highcharter package

Key Points from This Week

Introduction to Geographic Data

  • Use skills already covered to

    • clean data and check text variables

    • join datasets

    • create plot grids for comparing variables

  • Determine if variables need to be log-transformed.

  • Quiz 2 Practice Questions are available.

  • Quiz 2 will be on Tuesday, 4/7.

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.