Week 12

More on Geographic Data, Project Management, and Publishing a Dashboard

Author

Penelope Pooler Eisenbies

Published

April 7, 2025

Housekeeping

Final Proposals and Group Projects
- Your participation in the final proposal and plan will affect your final grade on the project.
- Group members that don’t contribute will not get credit for the work done by others in the group.
- If you have data management questions, reach out to myself or a course TA.
- We are here to help with tasks where you might be stymied, but don’t wait until the last day.
Presentations will be on 4/28.
- Only presentation dashboards are due on 4/28.
- All students are required to attend and provide feedback.

More Housekeeping

HW 5 - Part 2 is now posted and I am updating the demo videos because this assignment will be done on Posit Cloud.
- There is a 2 day grace period, if needed.
- This a short assignment that covers some final essential skills for your dashboard project.
Quiz 2 grading is almost done. There are a couple students taking a make-up this evening.

Plans for this week

Two lectures on Geographic Data have been streamlined so that students also have time for group work this week.
I will not cover them all in detail
Rather than deleting notes and code that might be useful to some students all notes are provided.

Topics Covered

Geographic Data: world data, state data, and filtering map data to a region
Publishing work: More tips for good project management
- Posting HTML files for free using Rpubs
  - Note: Rpubs is the recommended option for presenting and submitting your dashboard
HW 5 - Part 2 Demo on Posit Cloud
There will be time to work on projects in class this week and next week.

Importing and Joining World Datasets

World Data

Code

```{r world data prep}
world <- map_data("world") |> select(!subregion) |>     # world geo info
  mutate(region=ifelse(region=="UK", "United Kingdom", region))
  
intbxo <- read_csv("data/intl_bxo.csv", show_col_types = F, skip=7) |>      # import/tidy bxo
  select(1,6) |>
  rename("region" = "Area", "wknd_gross" = "Weekend Gross") |>
  filter(!is.na(wknd_gross)) |>
  mutate(wknd_gross = gsub("$", "", wknd_gross, fixed = T),
         wknd_gross = gsub(",", "", wknd_gross, fixed = T) |> as.numeric())
```

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `wknd_gross = as.numeric(gsub(",", "", wknd_gross, fixed = T))`.
Caused by warning:
! NAs introduced by coercion

Code

```{r world data prep}
world_bxo_data <- full_join(intbxo, world) |>                                # join datasets
  filter(!is.na(wknd_gross))
```

Joining with `by = join_by(region)`

Code

```{r world data prep}
world_bxo_data$continent = countrycode(sourcevar = world_bxo_data$region,    # retrieve continents
                                       origin = "country.name",
                                       destination = "continent")  
```

Warning: Some values were not matched unambiguously: Central America, Middle East Other, Serbia and Montenegro

Code

```{r world data prep}
head(world_bxo_data, 3)
```

# A tibble: 3 × 7
  region         wknd_gross  long   lat group order continent
  <chr>               <dbl> <dbl> <dbl> <dbl> <int> <chr>    
1 United Kingdom   19000000 -1.07  50.7   570 40057 Europe   
2 United Kingdom   19000000 -1.15  50.7   570 40058 Europe   
3 United Kingdom   19000000 -1.18  50.6   570 40059 Europe

Choropleth Country Plot w/ Labels

Example - Asia

Most of the plot code that follows is review
- There are a few new details:
  - shadowtext labels (see below)
  - modifying size of text elements (mentioned but not emphasized)
NOTES:
- The R package shadowtext includes the command geom_shadowtext
- shadowtext is useful for creating visible labels for all countries regardless of map fill color
- Deciding on units ($1000) and transformation (log) took some trial and error.

Managing Data for Asia Chropleth Map

This R code creates the Asia Map dataset.

Code

```{r asia data for map}
asia_bxo_data <- world_bxo_data |>           # create asia box office dataset 
  filter(continent=="Asia") |>
  mutate(Gross = as.integer(wknd_gross), 
         wknd_gross = wknd_gross/1000) 

asia_nms <- asia_bxo_data |>                         # create dataset of country names 
  select(region, long, lat, group, continent) |>     # median lat and long 
                                                     # used for label positions
  group_by(continent, region) |>
  summarize(nm_x=median(long, na.rm=T),
            nm_y=median(lat, na.rm=T)) |>
  filter(!is.na(nm_x) | !is.na(nm_y))
```

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

Code

```{r asia data for map}
asia_bxo_data <- full_join(asia_bxo_data, asia_nms) # merge datasets using an inner_join
```

Joining with `by = join_by(region, continent)`

R code for Asia Choropleth Map

Data are shown on log scale to improve interpretability.

Code

```{r asia static map code}
asia_bxo_map <- asia_bxo_data |>    # Creates the map that follows
   ggplot(aes(x=long, y=lat, group=group, fill=wknd_gross)) +
   geom_polygon(color="darkgrey") +
   theme_map() +
   coord_map("albers", lat0 = 39, lat1 = 45) +
   labs(fill= "Gross ($1K)",
        title="Weekend Gross ($ Thousands) in Asian Countries",
        subtitle="Weekend Data Updated 4/7/25 - Data are Log-transformed",
        caption="Data Source: https://www.boxofficemojo.com") +
    
   scale_fill_continuous(type = "viridis",  trans="log",
                         breaks =c(1,10,100,1000,10000)) +
   geom_shadowtext(aes(x=nm_x, y=nm_y,label=region),
                   color="white",check_overlap = T,
                   show.legend = F, size=4) + 
                   
   theme(plot.title = element_text(size = 20),
         plot.subtitle = element_text(size = 15),
         plot.caption = element_text(size = 10),
         legend.text = element_text(size = 12),
         legend.title = element_text(size = 15),
         plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) 
```

Asia Map with Log (LN) Transformation

Europe Map Data

Creates data for Europe Map

Code

```{r europe data for map}
euro_bxo_data <- world_bxo_data |>           # create Europe box office dataset 
  filter(continent=="Europe" & region != "Russia") |>
  mutate(Gross = as.integer(wknd_gross), 
         wknd_gross = wknd_gross/1000) 

euro_nms <- euro_bxo_data |>                         # create dataset of country names 
  select(region, long, lat, group, continent) |>     # median lat and long used for position
  group_by(continent, region) |>
  summarize(nm_x=median(long, na.rm=T),
            nm_y=median(lat, na.rm=T)) |>
  filter(!is.na(nm_x) | !is.na(nm_y))
```

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

Code

```{r europe data for map}
euro_bxo_data <- full_join(euro_bxo_data, euro_nms) # merge datasets using an inner_join
```

Joining with `by = join_by(region, continent)`

R code for Europe Choropleth Map

Data are shown on log scale to improve interpretability.

Code

```{r europe static map code}
euro_bxo_map <- euro_bxo_data |>
   ggplot(aes(x=long, y=lat,
              group=group,
              fill=wknd_gross)) +
   geom_polygon(color="darkgrey") +
   theme_map() +
   coord_map("albers", lat0 = 39, lat1 = 45) +
   labs(fill= "Gross ($1K)",
        title="Weekend Gross ($ Thousands) in European Countries",
        subtitle="Weekend Ending 11/10/24 - Data are Log-transformed",
        caption="Data Source: https://www.boxofficemojo.com") +
    
   scale_fill_continuous(type = "viridis",  trans="log",
                         breaks =c(1,10,100,1000,10000)) +
   geom_shadowtext(aes(x=nm_x, y=nm_y,label=region),
                   color="white",check_overlap = T,
                   show.legend = F, size=4) + 
                   
   theme(plot.title = element_text(size = 20),
         plot.subtitle = element_text(size = 15),
         plot.caption = element_text(size = 10),
         legend.text = element_text(size = 12),
         legend.title = element_text(size = 15))
```

Europe Map with Log (LN) Transformation

Week 12 In-class Exercises - Q1-Q3

Session ID: bua455f24

Question 1. What option is used in geom_polygon() to create the outlines of each country?

Question 2. How many different geometries (geom_...) are used to create these multi-layer maps?

Question 3. When using multiple geometry layers, where do you place the aesthetic, (aes) so that it will apply to all of the geometries (all of the map layers)?

US State Data Example

Examples of Data that can be plotted by state
- Average costs and expenditures by state of specific goods or services
- Demographic data
- Voting and tex information
- Sports/Arts/Entertainment/Education investments and expenditures
Will also show a map of data filtered by region

US State Map Data

Code

```{r combine state polygons with state population data from R}
us_states <- map_data("state") |>          # state polygons (from R)
  select(long:region) |>
  rename("state" = "region")

state_abbr <- state_stats |>               # many useful variables in this dataset
  select(state, abbr) |>
  mutate(state = tolower(state))

state_pop <- county_2019 |>               # data by county (aggregated by state)
  select(state, pop) |>
  mutate(state=tolower(state),
         popM = pop/1000000) |>
  group_by(state) |>
  summarize(st_popM = sum(popM, na.rm=T)) |>
  full_join(state_abbr)
```

Joining with `by = join_by(state)`

Code

```{r combine state polygons with state population data from R}
statepop_map <- left_join(us_states, state_pop) # used left join to filter to lower 48 states
```

Joining with `by = join_by(state)`

Code

```{r combine state polygons with state population data from R}
# lat/long not available for Hi and AK
```

Adding State Midpoint (centroid) Lat and Long

In the previous maps (by country) country labels were added to the static map using each polygon’s (country) median latitude and longitude
Medians don’t work well for U.S. because many states are oddly shaped and small.
Alternative: use centroid for each state polygon
- Centroid is another term for midpoint
- Saved data as .csv file named state_coords.csv (included)
  - Data did not include D.C. but those coordinates were found elsewhere
  - D.C. data is appended to other states using bind_rows
  - state_coords (centroids) were joined with state demographics data, statepop_map.
Final dataset for plot created: statepop_map

Code for Addings Centroids to data

Code

```{r add lat and long of state midpoints (centroid)}
state_coords <- read_csv("data/state_coords.csv", show_col_types = F,
                         col_names = c("state", "m_lat", "m_long")) |>
  mutate(state = gsub(", USA", "", state, fixed=T),
         state = gsub(", the USA", "", state, fixed=T),
         state = gsub(", the US", "", state, fixed=T),
         state = tolower(state))

state <- "district of columbia"        # save values for dc
m_lat <- 38.9072
m_long <- -77.0369
dc <- tibble(state, m_lat, m_long)     # create dataset of dc data ( 1 obs)
state_coords <- bind_rows(state_coords, dc) # add dc to state_coords

rm(dc, state, m_lat, m_long)           # remove temporary values from global
statepop_map <- left_join(statepop_map, state_coords) # centroids to data
```

Joining with `by = join_by(state)`

State Population Plot

Similar to previous plots with a few changes
- Added borders to states by adding color="darkgrey" to geom_polygon command.
- Used State abbreviations for state labels.
- Made State text labels smaller (Size = 2)
- Changed breaks for log scaled population legend
These details seem minor but they take time and trial and error.

R Code for US State Pop. Map (no transformation)

Code

```{r code for us states pop map no transformation}
st_pop <- statepop_map |>
    ggplot(aes(x=long, y=lat, group=group, fill=st_popM)) +
    geom_polygon(color="darkgrey") +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    scale_fill_continuous(type = "viridis") +
    geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr),
                    color="white", check_overlap = T,
                    show.legend = F, size=4) + 
    labs(fill= "Pop. in Millions", title="Population by State",
         subtitle="Unit is 1 Million People",
         caption= "Not Shown: HI: 1.42 Million   AK: 0.74 Million
         Data Source: https://CRAN.R-project.org/package=usdata") +
    theme(legend.position = "bottom",
          legend.key.width = unit(1, "cm"),
          plot.title = element_text(size = 20),
          plot.subtitle = element_text(size = 15),
          plot.caption = element_text(size = 15),
          legend.text = element_text(size = 15),
          legend.title = element_text(size = 15))  
```

US State Pop. Map (no transformation)

R Code for US State Pop. Map (log transformed)

Code

```{r code for us states pop map with log transformation}
st_lpop <- statepop_map |>
    ggplot(aes(x=long, y=lat, group=group, fill=st_popM)) +
    geom_polygon(color="darkgrey") +
    theme_map() +
    coord_map("albers", lat0 = 39, lat1 = 45) +
    scale_fill_continuous(type = "viridis", trans="log",
                          breaks=c(0,1,2,3,5,10,20,35))  +
    geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr),
                    color="white", check_overlap = T,
                    show.legend = F, size=4) + 
    labs(fill= "Pop. in Millions", title="Population by State",
         subtitle="Unit is 1 Million People - Log Transformed",
         caption= "Not Shown: HI: 1.42 Million   AK: 0.74 Million
         Data Source: https://CRAN.R-project.org/package=usdata") +
    theme(legend.position = "bottom",
          legend.key.width = unit(1, "cm"),
          plot.title = element_text(size = 20),
          plot.subtitle = element_text(size = 15),
          plot.caption = element_text(size = 15),
          legend.text = element_text(size = 15),
          legend.title = element_text(size = 15))  
```

US State Pop. Map (log transfomed)

To log or not to log

In this course, we visualize data using ggplot, hchart, dygraph

If you want to explore but (not present) data, you can also use base graphics for quick plots

Base graphics could also be used to make polished visualizations but the code is much longer and more tedious than ggplot

Code

```{r base graphics plots, fig.dim=c(5, 6), fig.align='center', out.extra='style="background-color: #3D3D3D; padding:1px;"'}
par(mfrow=c(2,1)) # stacks base graph plots
hist(statepop_map$st_popM, main="")
hist(log(statepop_map$st_popM), main="")
par(mfrow=c(1,1)) # resets base graph options
```

Filtering a Map to a Region

Map techniques above can also be used for a region
Demo that follows uses an education dataset with data filtered to 10 Northeastern states

Code

```{r import modify filter education data}
edu <- read_csv("data/education by state.csv", skip=3, show_col_types = F, # import data
                col_names = c("state", "pop_over_25", "pop_hs", "pct_hs",
                              "pop_bachelor", "pct_bachelor", 
                              "pop_advanced","pct_advanced")) 
edu1 <- edu |>
  select(state, pop_bachelor, pct_bachelor) |>
  mutate(state = str_trim(state) |> tolower(),
         pop_bachelor1K = pop_bachelor/1000,
         pct_bachelor = gsub("%","", pct_bachelor, fixed = T) |> as.numeric()) |> 
  filter(state %in% c("maine", "massachusetts", "connecticut" , "rhode island",
                      "vermont", "new hampshire", "new york", "new jersey", "pennsylvania",
                      "delaware")) |> glimpse()
```

Rows: 10
Columns: 4
$ state          <chr> "vermont", "rhode island", "pennsylvania", "new york", …
$ pop_bachelor   <dbl> 172272, 260275, 2917402, 5166218, 2551765, 368237, 2181…
$ pct_bachelor   <dbl> 38.66, 34.84, 32.31, 37.81, 41.22, 37.58, 44.98, 33.19,…
$ pop_bachelor1K <dbl> 172.272, 260.275, 2917.402, 5166.218, 2551.765, 368.237…

Exploratory Bachelor Degree Data Plots

Week 12 In-class Exercises - Q1-Q3

Session ID: bua455f24

Question 4. What exploratory plot command (base R code shown) is good for checking if the variable you want to plot is right skewed and might need to be log transformed?

Question 5. Based on the histogram for the northeastern area of the U.S, which includes only 10 states, do these data appear skewed?

Add Education Data to Map Data

In the chunk below we start from scratch with state data. This chunk does not depend on the data being imported and managed in a previous chunk.

Code

```{r join edu data with state map and state abbr data}
us_states <- map_data("state") |>    # state polygons (from R)
  select(long:region) |> rename("state" = "region")
state_abbr <- state_stats |>         # state abbreviations 
  select(state, abbr) |> mutate(state = tolower(state))

edu1 <- left_join(edu1, state_abbr)      # left join to maintain filter to NE states
```

Joining with `by = join_by(state)`

Code

```{r join edu data with state map and state abbr data}
edu_NE_map <- left_join(edu1, us_states) # left join to maintain filter to NE states
```

Joining with `by = join_by(state)`

Code

```{r join edu data with state map and state abbr data}
state_coords <- read_csv("data/state_coords.csv", show_col_types = F,       # add in state midpoints (centroids)
                         col_names = c("state", "m_lat", "m_long")) |>
  mutate(state = gsub(", USA", "", state, fixed=T),
         state = gsub(", the USA", "", state, fixed=T),
         state = gsub(", the US", "", state, fixed=T),
         state = tolower(state))
edu_NE_map <- left_join(edu_NE_map, state_coords)  # left join to maintain filter to NE states
```

Joining with `by = join_by(state)`

Code for Regional Map 1

Population with Bachelor’s Degree
Data Source - Wikipedia

Code

```{r NE edu map pop}
ne_edu_pop <- edu_NE_map |>         
  ggplot(aes(x=long, y=lat, group=group, fill=pop_bachelor1K)) +   # pop in 1000s
  geom_polygon(color="darkgrey") +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  scale_fill_continuous(type = "viridis", trans="log",             # log transformation
                        breaks = c(100, 500, 1000, 5000)) +
  geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr),
                  color="white", check_overlap = T, show.legend = F, size=4) + 
  labs(fill= "Unit: 1000 People", 
       title="NE States: Pop. with a Bachelor's Degree") +
  theme(legend.position = "bottom",
        legend.key.width = unit(1, "cm"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 15),
        plot.caption = element_text(size = 15),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15))
```

Code for Regional Map 2

Percentage of People with Bachelor’s Degree Data Source - Wikipedia

Code

```{r NE edu map pct}
ne_edu_pct <- edu_NE_map |>
  ggplot(aes(x=long, y=lat, group=group, fill=pct_bachelor)) +        # percent data
  geom_polygon(color="darkgrey") +
  theme_map() +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  scale_fill_continuous(type = "viridis",                             # no transformation needed
                        breaks = c(32, 34, 36, 38, 40, 42, 44)) +
  geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr),
                  color="white", check_overlap = T, show.legend = F, size=4) + 
  labs(fill= "Unit: %", title="NE States: Percent with a Bachelor's Degree") +
  theme(legend.position = "bottom",
        legend.key.width = unit(1, "cm"),
        plot.title = element_text(size = 20),
        plot.subtitle = element_text(size = 15),
        plot.caption = element_text(size = 15),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15))
```

Pop. and Pcnt. Plots Side by Side

Managing Projects

Some of this should be review
next week, we will talk about managing a long term consulting project
- Managing files over time
- Segmenting and rejoining poorly formatted data
- Documenting steps as you progress
- Addressing client needs as they eveolve and update requests
Documentation is key
- Take good notes and keep README file updated
I use Markdown or Quarto files for everything, even work I don’t present to client.
- Ideal format for writing notes between code chunks

BUA 455 Project File Conventions

Main Project Folder:
- Dashboard qmd file (Quarto file)
- Dashboard .html file (Dashboard presentation)
- Project .rproj file that makes folder into an R project.
- README.txt file that includes an organized of all files you created or saved.
- Other files created when when .qmd file is rendered.
  - These ‘byproduct’ files do not need to be listed in the README.
Data (data) Folder:
All raw .csv files needed (No data management should be done in Excel!)
Images (img) Folder:
- Any .png or other graphics files needed
OPTIONAL: Extraneous useful code can be saved in a separate folder within the project.

Rpubs Exercise

RPubs (mentioned earlier in this set of slides)
If you want to publish your dashboard or any HTML file you create in R, you can do so for free.
R has a public online repository called RPubs.
Rpubs is very useful if you want post an html file online and provide the link to it.
I Use RPubs for slides in this course and it is useful if for work like the project dashboards.
As an in class exercise, I will ask you each to create an account and publish your HW 5 - Part 1 dashboard html file.
- This exercise will be useful because it allows you to see how this publication process works.
  - You will see how publishing changes the appearance of your panels and text.
  - Once you post your final dashboard you may want to include it as a link in your resume and/or LinkedIn profile.

In-class Exercise

Open your HW 5 - Part 1.Rmd file and knit it to create your dashboard.

Make sure this file has your name in the header.
If you don’t have HW 5 - Part 1 done, you can use the Posit Cloud version of HW 5 - Part 1 provided for HW 5 - Part 2

Click the Publish Icon , create a free account, and publish your html file.

If RStudio asks to install additional packages to complete the publishing process, click Yes.

Submit the link to your published file on Blackboard.

A Link to your published file must be submitted by Friday 4/11 at midnight to count for class participation for today’s lecture.

Next Week - Additional Topics

Ask me questions about your project (Others may benefit)
I have some short essential and some optional topics including:
- details and recommendations for writing both project memos.
  - Memos will be written as word documents in Quarto (.qmd).
- managing a consulting project from beginning to end.
- formatting complex tables using the gt package.
- knitting Quarto files to different formats: word, Powerpoint, etc.

Additional Topics Continued

Review of Skillset Terminology

Now that you are (almost) done with BUA 455, and more so when you graduate, you have a very useful set of skills.

Explaining these skills to others is a challenge.
I will spend a little time talking about how to explain those skills to other people
Preview: It took me decades to figure out how to talk about what I do, in part, because this discipline was more obscure.
- Increased interest in Data Science and Analytics has resulted in better terminology.
- White Paper from DataCamp provides an excellent blueprint

Key Points from This Week

More with Geographic Data

Adding Shadow Text
Filtering Map Data and Comparing Variables

Project Management

Review of skills covered throughout course
Managing data projects this way is beneficial

Publishing Work on RPubs

Useful for publishing and linking to work

You may submit an ‘Engagement Question’ about each lecture until midnight on the day of the lecture. A minimum of four submissions are required during the semester.

--- title: "Week 12" subtitle: "More on Geographic Data, Project Management, and Publishing a Dashboard" author: "Penelope Pooler Eisenbies" date: last-modified lightbox: true toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r include=F} #|label: setup knitr::opts_chunk$set(echo=T, highlight=T) # specifies default options for all chunks options(scipen=100) # suppress scientific notation # install pacman if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") pacman::p_load(pacman, tidyverse, ggthemes, gridExtra, magrittr, kableExtra, RColorBrewer, maps, usdata, countrycode, mapproj, shadowtext, grid) # install and load required packages p_loaded() # verify loaded packages ``` ## Housekeeping - Final Proposals and Group Projects - Your participation in the final proposal and plan will affect your final grade on the project. - Group members that don't contribute will not get credit for the work done by others in the group. - If you have data management questions, reach out to myself or a course TA. - We are here to help with tasks where you might be stymied, but don't wait until the last day. - Presentations will be on 4/28. - Only presentation dashboards are due on 4/28. - All students are required to attend and provide feedback. ## More Housekeeping - HW 5 - Part 2 is now posted and I am updating the demo videos because this assignment will be done on Posit Cloud. - There is a 2 day grace period, if needed. - This a short assignment that covers some final essential skills for your dashboard project. - Quiz 2 grading is almost done. There are a couple students taking a make-up this evening. ## Plans for this week - Two lectures on Geographic Data have been streamlined so that students also have time for group work this week. - I will not cover them all in detail - Rather than deleting notes and code that might be useful to some students all notes are provided. ::: fragment **Topics Covered** ::: - Geographic Data: world data, state data, and filtering map data to a region - Publishing work: More tips for good project management - Posting HTML files for free using [**Rpubs**](https://rpubs.com/) - Note: [**Rpubs**](https://rpubs.com/) is the recommended option for presenting and submitting your dashboard - HW 5 - Part 2 Demo on Posit Cloud - **There will be time to work on projects in class this week and next week.** ## Importing and Joining World Datasets **World Data** ```{r world data prep} world <- map_data("world") |> select(!subregion) |> # world geo info mutate(region=ifelse(region=="UK", "United Kingdom", region)) intbxo <- read_csv("data/intl_bxo.csv", show_col_types = F, skip=7) |> # import/tidy bxo select(1,6) |> rename("region" = "Area", "wknd_gross" = "Weekend Gross") |> filter(!is.na(wknd_gross)) |> mutate(wknd_gross = gsub("$", "", wknd_gross, fixed = T), wknd_gross = gsub(",", "", wknd_gross, fixed = T) |> as.numeric()) world_bxo_data <- full_join(intbxo, world) |> # join datasets filter(!is.na(wknd_gross)) world_bxo_data$continent = countrycode(sourcevar = world_bxo_data$region, # retrieve continents origin = "country.name", destination = "continent") head(world_bxo_data, 3) ``` ## Choropleth Country Plot w/ Labels ::: fragment **Example - Asia** ::: - **Most** of the plot code that follows is review - There are a few new details: - `shadowtext` labels (see below) - modifying size of text elements (mentioned but not emphasized) - **NOTES:** - The R package `shadowtext` includes the command `geom_shadowtext` - `shadowtext` is useful for creating visible labels for all countries regardless of map fill color - Deciding on units (\$1000) and transformation (`log`) took some trial and error. ## Managing Data for Asia Chropleth Map **This R code creates the Asia Map dataset.** ```{r asia data for map} asia_bxo_data <- world_bxo_data |> # create asia box office dataset filter(continent=="Asia") |> mutate(Gross = as.integer(wknd_gross), wknd_gross = wknd_gross/1000) asia_nms <- asia_bxo_data |> # create dataset of country names select(region, long, lat, group, continent) |> # median lat and long # used for label positions group_by(continent, region) |> summarize(nm_x=median(long, na.rm=T), nm_y=median(lat, na.rm=T)) |> filter(!is.na(nm_x) | !is.na(nm_y)) asia_bxo_data <- full_join(asia_bxo_data, asia_nms) # merge datasets using an inner_join ``` ## R code for Asia Choropleth Map Data are shown on log scale to improve interpretability. ```{r asia static map code} asia_bxo_map <- asia_bxo_data |> # Creates the map that follows ggplot(aes(x=long, y=lat, group=group, fill=wknd_gross)) + geom_polygon(color="darkgrey") + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "Gross ($1K)", title="Weekend Gross ($ Thousands) in Asian Countries", subtitle="Weekend Data Updated 4/7/25 - Data are Log-transformed", caption="Data Source: https://www.boxofficemojo.com") + scale_fill_continuous(type = "viridis", trans="log", breaks =c(1,10,100,1000,10000)) + geom_shadowtext(aes(x=nm_x, y=nm_y,label=region), color="white",check_overlap = T, show.legend = F, size=4) + theme(plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), legend.title = element_text(size = 15), plot.background = element_rect(colour = "darkgrey", fill=NA, linewidth=2)) ``` ## ### Asia Map with Log (LN) Transformation ```{r fig.dim=c(15,7), echo=F, warning=F} asia_bxo_map ``` ## Europe Map Data Creates data for Europe Map ```{r europe data for map} euro_bxo_data <- world_bxo_data |> # create Europe box office dataset filter(continent=="Europe" & region != "Russia") |> mutate(Gross = as.integer(wknd_gross), wknd_gross = wknd_gross/1000) euro_nms <- euro_bxo_data |> # create dataset of country names select(region, long, lat, group, continent) |> # median lat and long used for position group_by(continent, region) |> summarize(nm_x=median(long, na.rm=T), nm_y=median(lat, na.rm=T)) |> filter(!is.na(nm_x) | !is.na(nm_y)) euro_bxo_data <- full_join(euro_bxo_data, euro_nms) # merge datasets using an inner_join ``` ## R code for Europe Choropleth Map Data are shown on log scale to improve interpretability. ```{r europe static map code} euro_bxo_map <- euro_bxo_data |> ggplot(aes(x=long, y=lat, group=group, fill=wknd_gross)) + geom_polygon(color="darkgrey") + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + labs(fill= "Gross ($1K)", title="Weekend Gross ($ Thousands) in European Countries", subtitle="Weekend Ending 11/10/24 - Data are Log-transformed", caption="Data Source: https://www.boxofficemojo.com") + scale_fill_continuous(type = "viridis", trans="log", breaks =c(1,10,100,1000,10000)) + geom_shadowtext(aes(x=nm_x, y=nm_y,label=region), color="white",check_overlap = T, show.legend = F, size=4) + theme(plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), legend.title = element_text(size = 15)) ``` ## ### Europe Map with Log (LN) Transformation ```{r fig.dim=c(15,7), echo=F, warning=F, fig.align='center', warning=F} euro_bxo_map ``` ## ### Week 12 In-class Exercises - Q1-Q3 ***Session ID: bua455f24*** **Question 1.** What option is used in `geom_polygon()` to create the outlines of each country? **Question 2.** How many different geometries (`geom_...`) are used to create these multi-layer maps? **Question 3.** When using multiple geometry layers, where do you place the aesthetic, (`aes`) so that it will apply to all of the geometries (all of the map layers)? ## US State Data Example - Examples of Data that can be plotted by state - Average costs and expenditures by state of specific goods or services - Demographic data - Voting and tex information - Sports/Arts/Entertainment/Education investments and expenditures - Will also show a map of data filtered by region ## US State Map Data ```{r combine state polygons with state population data from R} us_states <- map_data("state") |> # state polygons (from R) select(long:region) |> rename("state" = "region") state_abbr <- state_stats |> # many useful variables in this dataset select(state, abbr) |> mutate(state = tolower(state)) state_pop <- county_2019 |> # data by county (aggregated by state) select(state, pop) |> mutate(state=tolower(state), popM = pop/1000000) |> group_by(state) |> summarize(st_popM = sum(popM, na.rm=T)) |> full_join(state_abbr) statepop_map <- left_join(us_states, state_pop) # used left join to filter to lower 48 states # lat/long not available for Hi and AK ``` ## ### Adding State Midpoint (centroid) Lat and Long - In the previous maps (by country) country labels were added to the static map using each polygon's (country) median latitude and longitude - Medians don't work well for U.S. because many states are oddly shaped and small. - Alternative: [use centroid for each state polygon](https://www.latlong.net/category/states-236-14.html) - Centroid is another term for midpoint - Saved data as .csv file named `state_coords.csv` (included) - Data did not include D.C. but those coordinates were found elsewhere - D.C. data is appended to other states using `bind_rows` - `state_coords` (centroids) were joined with state demographics data, `statepop_map`. - Final dataset for plot created: `statepop_map` ## Code for Addings Centroids to data ```{r add lat and long of state midpoints (centroid)} state_coords <- read_csv("data/state_coords.csv", show_col_types = F, col_names = c("state", "m_lat", "m_long")) |> mutate(state = gsub(", USA", "", state, fixed=T), state = gsub(", the USA", "", state, fixed=T), state = gsub(", the US", "", state, fixed=T), state = tolower(state)) state <- "district of columbia" # save values for dc m_lat <- 38.9072 m_long <- -77.0369 dc <- tibble(state, m_lat, m_long) # create dataset of dc data ( 1 obs) state_coords <- bind_rows(state_coords, dc) # add dc to state_coords rm(dc, state, m_lat, m_long) # remove temporary values from global statepop_map <- left_join(statepop_map, state_coords) # centroids to data ``` ## State Population Plot - Similar to previous plots with a few changes - Added borders to states by adding `color="darkgrey"` to `geom_polygon` command. - Used State abbreviations for state labels. - Made State text labels smaller (Size = 2) - Changed breaks for log scaled population legend - These details seem minor but they take time and trial and error. ## ### R Code for US State Pop. Map (no transformation) ```{r code for us states pop map no transformation} st_pop <- statepop_map |> ggplot(aes(x=long, y=lat, group=group, fill=st_popM)) + geom_polygon(color="darkgrey") + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + scale_fill_continuous(type = "viridis") + geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr), color="white", check_overlap = T, show.legend = F, size=4) + labs(fill= "Pop. in Millions", title="Population by State", subtitle="Unit is 1 Million People", caption= "Not Shown: HI: 1.42 Million AK: 0.74 Million Data Source: https://CRAN.R-project.org/package=usdata") + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 15), legend.text = element_text(size = 15), legend.title = element_text(size = 15)) ``` ## ### US State Pop. Map (no transformation) ```{r us states pop map no transformation, echo=F, fig.dim=c(15,7), fig.align='center'} st_pop ``` ## ### R Code for US State Pop. Map (log transformed) ```{r code for us states pop map with log transformation} st_lpop <- statepop_map |> ggplot(aes(x=long, y=lat, group=group, fill=st_popM)) + geom_polygon(color="darkgrey") + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + scale_fill_continuous(type = "viridis", trans="log", breaks=c(0,1,2,3,5,10,20,35)) + geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr), color="white", check_overlap = T, show.legend = F, size=4) + labs(fill= "Pop. in Millions", title="Population by State", subtitle="Unit is 1 Million People - Log Transformed", caption= "Not Shown: HI: 1.42 Million AK: 0.74 Million Data Source: https://CRAN.R-project.org/package=usdata") + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 15), legend.text = element_text(size = 15), legend.title = element_text(size = 15)) ``` ## ### US State Pop. Map (log transfomed) ```{r us states pop map log transformation, echo=F, fig.dim=c(15,7), fig.align='center'} st_lpop ``` ## To log or not to log :::::: columns ::: {.column width="48%"} In this course, we visualize data using `ggplot`, `hchart`, `dygraph` If you want to explore but (not present) data, you can also use base graphics for quick plots - Base graphics could also be used to make polished visualizations but the code is much longer and more tedious than `ggplot` ::: ::: {.column width="4%"} ::: ::: {.column width="48%"} ```{r base graphics plots, fig.dim=c(5, 6), fig.align='center', out.extra='style="background-color: #3D3D3D; padding:1px;"'} par(mfrow=c(2,1)) # stacks base graph plots hist(statepop_map$st_popM, main="") hist(log(statepop_map$st_popM), main="") par(mfrow=c(1,1)) # resets base graph options ``` ::: :::::: ## Filtering a Map to a Region - Map techniques above can also be used for a region - Demo that follows uses an education dataset with data filtered to 10 Northeastern states ::: fragment ```{r import modify filter education data} edu <- read_csv("data/education by state.csv", skip=3, show_col_types = F, # import data col_names = c("state", "pop_over_25", "pop_hs", "pct_hs", "pop_bachelor", "pct_bachelor", "pop_advanced","pct_advanced")) edu1 <- edu |> select(state, pop_bachelor, pct_bachelor) |> mutate(state = str_trim(state) |> tolower(), pop_bachelor1K = pop_bachelor/1000, pct_bachelor = gsub("%","", pct_bachelor, fixed = T) |> as.numeric()) |> filter(state %in% c("maine", "massachusetts", "connecticut" , "rhode island", "vermont", "new hampshire", "new york", "new jersey", "pennsylvania", "delaware")) |> glimpse() ``` ::: ## Exploratory Bachelor Degree Data Plots <center> ```{r base R scatterplot and histogram, out.extra='style="background-color: #3D3D3D; padding:1px;"', fig.dim=c(12,7), echo=FALSE} par(mfrow=c(1,2)) hist(edu1$pop_bachelor1K, main="") plot(edu1$pop_bachelor1K, edu1$pct_bachelor, main="") par(mfrow=c(1,1)) ``` </center> ## ### Week 12 In-class Exercises - Q1-Q3 ***Session ID: bua455f24*** **Question 4.** What exploratory plot command (base R code shown) is good for checking if the variable you want to plot is right skewed and might need to be log transformed? **Question 5.** Based on the histogram for the northeastern area of the U.S, which includes only 10 states, do these data appear skewed? ## Add Education Data to Map Data In the chunk below we start from scratch with state data. This chunk does not depend on the data being imported and managed in a previous chunk. ```{r join edu data with state map and state abbr data} us_states <- map_data("state") |> # state polygons (from R) select(long:region) |> rename("state" = "region") state_abbr <- state_stats |> # state abbreviations select(state, abbr) |> mutate(state = tolower(state)) edu1 <- left_join(edu1, state_abbr) # left join to maintain filter to NE states edu_NE_map <- left_join(edu1, us_states) # left join to maintain filter to NE states state_coords <- read_csv("data/state_coords.csv", show_col_types = F, # add in state midpoints (centroids) col_names = c("state", "m_lat", "m_long")) |> mutate(state = gsub(", USA", "", state, fixed=T), state = gsub(", the USA", "", state, fixed=T), state = gsub(", the US", "", state, fixed=T), state = tolower(state)) edu_NE_map <- left_join(edu_NE_map, state_coords) # left join to maintain filter to NE states ``` ## Code for Regional Map 1 **Population with Bachelor's Degree**\ [Data Source - Wikipedia](https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_educational_attainment) ```{r NE edu map pop} ne_edu_pop <- edu_NE_map |> ggplot(aes(x=long, y=lat, group=group, fill=pop_bachelor1K)) + # pop in 1000s geom_polygon(color="darkgrey") + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + scale_fill_continuous(type = "viridis", trans="log", # log transformation breaks = c(100, 500, 1000, 5000)) + geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr), color="white", check_overlap = T, show.legend = F, size=4) + labs(fill= "Unit: 1000 People", title="NE States: Pop. with a Bachelor's Degree") + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 15), legend.text = element_text(size = 15), legend.title = element_text(size = 15)) ``` ## Code for Regional Map 2 Percentage of People with Bachelor's Degree [Data Source - Wikipedia](https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_educational_attainment) ```{r NE edu map pct} ne_edu_pct <- edu_NE_map |> ggplot(aes(x=long, y=lat, group=group, fill=pct_bachelor)) + # percent data geom_polygon(color="darkgrey") + theme_map() + coord_map("albers", lat0 = 39, lat1 = 45) + scale_fill_continuous(type = "viridis", # no transformation needed breaks = c(32, 34, 36, 38, 40, 42, 44)) + geom_shadowtext(aes(x=m_long, y=m_lat, label=abbr), color="white", check_overlap = T, show.legend = F, size=4) + labs(fill= "Unit: %", title="NE States: Percent with a Bachelor's Degree") + theme(legend.position = "bottom", legend.key.width = unit(1, "cm"), plot.title = element_text(size = 20), plot.subtitle = element_text(size = 15), plot.caption = element_text(size = 15), legend.text = element_text(size = 15), legend.title = element_text(size = 15)) ``` ## Pop. and Pcnt. Plots Side by Side ```{r display of NE pop and pct maps, fig.dim=c(15,7), fig.align='center', echo=FALSE} grid.arrange(ne_edu_pop, ne_edu_pct, ncol=2) grid.rect(width = .98, height = .98, gp = gpar(lwd = 2, col = "darkgrey", fill = NA)) ``` ## Managing Projects - Some of this should be review - next week, we will talk about managing a long term consulting project - Managing files over time - Segmenting and rejoining poorly formatted data - Documenting steps as you progress - Addressing client needs as they eveolve and update requests - Documentation is key - Take good notes and keep README file updated - I use Markdown or Quarto files for everything, even work I don't present to client. - Ideal format for writing notes between code chunks ## BUA 455 Project File Conventions - **Main Project Folder:** - Dashboard `qmd` file (Quarto file) - Dashboard .html file (Dashboard presentation) - Project `.rproj` file that makes folder into an R project. - `README.txt` file that includes an organized of all files you created or saved. - Other files created when when .qmd file is rendered. - These 'byproduct' files do not need to be listed in the README. - **Data (`data`) Folder:** - All raw .csv files needed (No data management should be done in Excel!) - **Images (`img`) Folder:** - Any .png or other graphics files needed - **OPTIONAL:** Extraneous **useful** code can be saved in a separate folder within the project. ## Rpubs Exercise - **RPubs** (mentioned earlier in this set of slides) - If you want to publish your dashboard or any HTML file you create in R, you can do so for free. - R has a public online repository called [**RPubs**](https://rpubs.com/). - **Rpubs** is very useful if you want post an html file online and provide the link to it. - I Use **RPubs** for slides in this course and it is useful if for work like the project dashboards. - As an in class exercise, I will ask you each to create an account and publish your HW 5 - Part 1 dashboard html file. - This exercise will be useful because it allows you to see how this publication process works. - You will see how publishing changes the appearance of your panels and text. - Once you post your final dashboard you may want to include it as a link in your resume and/or LinkedIn profile. ## In-class Exercise 1. Open your HW 5 - Part 1.Rmd file and knit it to create your dashboard. - Make sure this file has your name in the header. - If you don't have HW 5 - Part 1 done, you can use the [Posit Cloud version of `HW 5 - Part 1` provided for HW 5 - Part 2](https://posit.cloud/content/10125821){target="_blank"} 2. Click the **Publish Icon** ![](img/publish_icon.png), create a free account, and publish your html file. - If RStudio asks to install additional packages to complete the publishing process, click `Yes`. 3. Submit the link to your published file on Blackboard. - A Link to your published file must be submitted by Friday 4/11 at midnight to count for class participation for today's lecture. ## Next Week - Additional Topics - Ask me questions about your project (Others may benefit) - I have some short essential and some optional topics including: - details and recommendations for writing both project memos. - Memos will be written as word documents in Quarto (`.qmd`). - managing a consulting project from beginning to end. - formatting complex tables using the `gt` package. - knitting Quarto files to different formats: word, Powerpoint, etc. ## Additional Topics Continued **Review of Skillset Terminology** ::: fragment Now that you are (almost) done with BUA 455, and more so when you graduate, you have a very useful set of skills. ::: - Explaining these skills to others is a challenge. - I will spend a little time talking about how to explain those skills to other people - Preview: It took me decades to figure out how to talk about what I do, in part, because this discipline was more obscure. - Increased interest in Data Science and Analytics has resulted in better terminology. - [White Paper from DataCamp provides an excellent blueprint](https://drive.google.com/file/d/1_VoM3D6tPftjZpXCnTL8SKYBlOM_4KjG/view?usp=sharing) ## ### Key Points from This Week ::: fragment **More with Geographic Data** ::: - Adding Shadow Text - Filtering Map Data and Comparing Variables ::: fragment **Project Management** ::: - Review of skills covered throughout course - Managing data projects this way is beneficial ::: fragment **Publishing Work on RPubs** ::: - Useful for publishing and linking to work ::: fragment You may submit an 'Engagement Question' about each lecture until midnight on the day of the lecture. **A minimum of four submissions are required during the semester.** :::