Housekeeping

Upcoming Dates

  • HW 5 - Part 1 is due on Wed. 3/26 - Grace Period ends 3/28

  • Draft Proposals are due Thursday 3/27

  • Thursday 3/27: In-class work day and NO OFFICE HOURS

  • Tuesday 4/1: Review using practice questions

  • Quiz 2 is on Thursday, 4/3

    • Mostly on Weeks 5 through 9, but material is cumulative.

    • Mostly on HW Assignments 4 and 5 , but material is cumulative.

    • Data Mgmt. tasks will requires 2 or 3 steps and then you will answer questions.

  • HW 5 - Part 2 will be posted after Quiz 2.

Today, Thursday, and Tuesday, 4/1

  • Practice Questions for Quiz 2 are posted.

  • Tue. 4/1: Skills/Concepts Review for Quiz 2

    • Putting skills together for different goals

    • Come with questions or submit them as Engagement Questions.

  • Today: Intro to Managing and Plotting Geographic Data

    • Today’s geographic maps will not be on Quiz 2.

    • Mapping data is an effective data curation tool that may be useful in this class, other classes, your career.

  • We will cover more about geographic data and map visualizations after Quiz 2.

  • If you want help with mapping project data, please reach out to me or TA.

In-class Exercise - Week 10

Purpose:

  • To gain some experience and understanding of map data available in R and elsewhere.

  • To experiment with mapping data

  • Students are encouraged to use domestic or international map data in their dashboards if appropriate.

Data Preparation

  • The data for today’s in-class exercise is part of R.

  • These geographic data are useful if you have information by state or county and you want to show a choropleth map of your data.

  • R also has world information, e.g., countries, continents, etc.

us_states <- map_data("state") |>               # state polygons (not used today)
  rename("state" = "region")
us_counties <- map_data("county") |>            # county polygons
  rename("state" = "region", "county" = "subregion") |>
  mutate(county = gsub(" ", "", county), 
         county = gsub("'","", county) |> tolower())
#unique(us_counties$county[us_counties$state=="louisiana"]) # note issue Louisiana counties
cnty2019_all <- county_2019
#unique(cnty2019_all$name[cnty2019_all$state=="Louisiana"]) # note issue Louisiana counties

cnty2019_all <- cnty2019_all |> 
  mutate(state = tolower(state),
         county = tolower(name),
         county = gsub(" county", "", county),
         county = gsub(" parish", "", county),
         county = gsub("\\.", "", county),          # \\ is required because . used in R coding
         county = gsub(" ", "", county),
         county = gsub("'","", county)) |>
  relocate(county, .before=name)

cnty2019_all <- full_join(us_counties,cnty2019_all)  # geo data and demographic data

County Demographic Plots

  • Creating a new dataset not required, but helpful

  • Plot code could be converted into a function

#|label: select data and plot 1 map
cnty_data1 <- cnty2019_all |> 
  select(long:county, hs_grad, bachelors, 
         household_has_computer, 
         household_has_broadband) 
cnty_hs_grad <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, fill=hs_grad)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with High School Degree") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen.

#|label: plot 2 code
cnty_bachelors <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=bachelors)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent with Bachelor's Degree") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen.

#|label:  plot 3 code
cnty_computer <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_computer)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with A Computer")+
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen

#|label:  plot 4 code
cnty_brdbd <- cnty_data1 |>
  ggplot(aes(x=long, y=lat, 
             group=group, 
             fill=household_has_broadband)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Percent of Households with BroadBand") +
    scale_fill_continuous(type = "viridis") + 
    theme(plot.background = element_rect(fill = "lightgrey", color = NA),
          legend.key.size = unit(.4, 'cm'),
          plot.title = element_text(size = 10),
          legend.text= element_text(size = 8))

Font in plot adjusted for screen

Demographic Plot Grid - Order Matters

Grid is populated left to right and then top to bottom unless otherwise specified.

#|label: grid of pct plots

# note alternative to grid.arrange used
# code shown is for export version of plots
grid <- plot_grid(cnty_hs_grad, cnty_computer, cnty_bachelors, cnty_brdbd, ncol=2, )

In-class Exercise - Plot Grid (Steps 1 & 2)

Steps to Follow

  1. Examine the available variables in the cnty_2019_all dataset saved from R.

  2. Create Individual Plots. Variable names in R dataset and definitions:

    • household_has_smartphone: Households with Smart Phones
    • median_age: Median Age
    • median_household_income: Median Household Income
    • median_individual_income: Median Individual Income
  • You can copy provided plot code and modify it for these variables or you could try to create a function (not required today).
#|label: usmaps demographic maps exercise
# select data for plot (not required, but helpful)
# cnty_data2 <- 

# create four individual plots, one for each variable
# use provided plot code and modify

In-class Exercise - Plot Grid (Steps 3 & 4)

  1. Create 2x2 plot grid of these four variables by US County as part of your class participation credit for this week.
  • For full credit, plots must be in the order specified.

    • Row 1: should have smartphone variable and median age.
    • Row 2: median household and individual income.
  • Create Plot Grid using plot_grid command.

#|label: in-class exercise 2x2 plot grid
# for full credit grid of plots must be in order specified
# use plot_grid command
# export plots to img folder using save_plot 
  1. Right click on plot grid, then save as… and save to img folder with correct name.
  • or use save_plot command

Mapping Log Transformed Data

  • The previous examples above are all percent data.

  • No transformations are needed

  • In contrast, population data or financial data are often right-skewed and need to be log transformed.

  • Recall from MAS 261, BUA 345 (and perhaps FIN classes):

    • An effective transformation for right skewed data is the natural log (LN) transformation.

    • The following demo shows how useful it is for mapping right skewed data.

Plot of Skewed Data

#|label: untransformed pop. map code

cnty_data3 <- cnty2019_all |> 
  select(long:county, pop) |>
  mutate(pop1k = pop/1000)
       
cnty_pop <- cnty_data3 |>
    ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People") +
    scale_fill_continuous(type = "viridis") +
    theme(legend.key.width = unit(.4, "cm"))

Histogram Clarifies Data Skewness

#|label: untransformed pop hist code
hist_pop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Population", title="Histogram of US Population Data",
       y = "Count",subtitle="Unit is 1000 People") +
  theme_classic()

The Problem: We see we have skewed data BUT presenting log transformed data in a map may complicate data interpretation.

Solution - Natural Log Transformed Plot

  • Data are not transformed, but data axes and scale are

  • Options specified in scale_fill_continuous:

    • trans = "log"
    • breaks = c(...)
  • break intervals determined by examining data

#|label: log transformed map plot code
cnty_lpop <- cnty_data3 |>
  ggplot(aes(x=long, y=lat, group=group, fill=pop1k)) +
    geom_polygon() +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    labs(fill= "", title="Population by County",
         subtitle="Unit is 1000 People and Date are Log-transformed") +
    scale_fill_continuous(type = "viridis",trans="log",
                          breaks=c(1,10,100,1000,10000)) +
    theme(legend.key.width = unit(.4, "cm"))

Histogram of Log transformed Data

  • Final dashboard doesn’t include exporatory plots
  • Data exploration plots:
    • Histograms, Scatterplots and boxplots are all useful
#|label: log transformed pop hist code
hist_lpop <- cnty_data3 |>
  ggplot() +
  geom_histogram(aes(x=pop1k),fill="lightblue", col="darkblue") +
  labs(x="Population", 
       title="Histogram of Natural Log of US Population Data",
       y = "Count",
       subtitle="Unit is 1000 People and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000,10000))

Histogram of Log-transformed Data

Population Plot Grid

#|label: plot code for pop grid
grid.arrange(cnty_pop, hist_pop,cnty_lpop, hist_lpop, ncol=2) 

When and How to Log Transform

  • Log transformation are useful if you have right skewed POSITIVE data such as

    • Prices
    • Population
    • Sales
    • Income
  • Note: If data (x) have zeros, a good option is to use log(x + 1)

    • ln(1) = 0 (In R log(1) = 0)

    • 0 values in the data will still be zeros

  • In the following example we will create plots for number of households by county:

    • Histograms with and without LN transformation

    • Map plots of with and without LN transformation

Number of Households Per County

Without Transformation

Untransformed Data Histogram

#|label: data and untransformed hh data histogram code
#|
cnty_data4 <- cnty2019_all |> 
  select(long:county, households) |>
  mutate(households1K = households/1000)

hist_hholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", title="Histogram of US Household Data",
       y = "Count",subtitle="Unit is 1000 Households") +
  theme_classic()

Number of Households is highly right-skewed.

Number of Households Per County

With Transformation

Log transformed Data Histogram

#|label: log transformed hh data histogram code
hist_lhholds <- cnty_data4 |>
  ggplot() +
  geom_histogram(aes(x=households1K),
                 fill="lightblue", 
                 col="darkblue") +
  labs(x="Number of Households", 
       title="Histogram of Natural Log of US Household Data",
       y = "Count",
       subtitle="Unit is 1000 Households and Data are Log-transformed") +
  theme_classic() + 
  scale_x_continuous(trans="log", breaks=c(1,10,100,1000))

Log-transformed data appear normally distributed

Number of Households Per County

cnty_hholds <- cnty_data4 |>
  ggplot(aes(x=long, y=lat, group=group, fill=households1K)) +
  geom_polygon() +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  labs(fill= "", title="Number of Households by County",
       subtitle="Unit is 1000 Households") +
  scale_fill_continuous(type = "viridis") +
  theme(legend.key.width = unit(.4, "cm"))

Map of untransformed households per county data is uninformative.

In-class Exercise 2

Log-transformed data map is more more informative about geographic variability.

Submit R code to create log transformed households map in a text (.txt) file with your name.

Households per County Plot Grid

grid.arrange(hist_hholds, hist_lhholds, cnty_hholds, cnty_lhholds, ncol=2)

Quiz 2 Information

  • Questions and Material from Quiz 1 may be on Quiz 2

  • Practice Questions will be posted by 10/31

    • Review Quiz 1 and Quiz 1 Practice Questions

    • Review Week HW 4 and HW 5 - Part 1 and recent lectures

  • Study Tip: Feel free to add on to practice questions .qmd file with extra chunks and notes so that all of your notes are in one place.

Quiz 2 Information Cont’d

  • Converting text (character) date information to a date using lubridate commands (Week 5)

    • Example R commands:ymd, dmy, mdy, ym combined with paste to combine columns
  • Extracting year, month, or day from the date variable using lubridate commands (Week 6)

    • Example R commands: year, month, quarter, wday, day
  • Converting an xts dataset to a tibble (standard R dataset) (Week 7)

    • Creating a lineplot from time series (non-xts) dataset
  • Converting a tibble to an xts dataset

    • Creating an interactive hchart

Quiz 2 Info Cont’d

  • You should be familiar with the bls_tidy function we created and how to use it to import similar datasets.

  • There will be datasets to be imported AND combined (joined)

    • Data sets can be joined by row.

    • You should know how to do the different joins we covered and what each one does:

      • full_join

      • right_join

      • left_join

      • inner_join

  • Data sets can be stacked by column if the columns are identical

    • In BUA 455 we covered bind_rows

Quiz 2 Info Cont’d

  • Cleaning messy data (Week 5 and Weeks 8-9)

    • Dealing with text (character variables)

      • gsub

      • separate

      • unite or paste or paste0

      • ifelse can be be used for text OR for numeric data

      • ifelse followed by factor allows you to make any categorical variable you want.

  • Other commands for modifying text:

    • tolower and toupper

    • str_trim, str_squish and str_pad

Additional Text Commands

Additional skills for Quiz 2 from HW 5 - Part 1:

  • summing across rows using sum(c_across(...))

  • using pivot_wider and then pivot_longer and then replacing NAs with 0 to create a ‘complete’ data set

    • Useful for area plots
  • Plotting skills for Quiz 2

    • unformatted line plot, area plot, or grouped bar chart

    • hchart in highcharter package

NOT ON QUIZ 2: