Code
Getting started with GGPLOT
This time I’m using the robobook markdown theme, in case you want to copy it. (Download the code to get the stuff you put at the top.) The libraries used are:
knitr tidyverse janitor scales
rmdformats lubridate forcats
The data
This data was created from an early version of the Paycheck Protection Program given in Arizona during the pandemic. Here are the column definitions
zcta
char
Zip code used in the census
zipcode_city
char
Census name associated with that zip code
st_county_fip
char
FIP code for the county
ethnicity
factor
County demographic: “No people”, “White”, “No distinct group”, “Black”, “Hispanic”, “Native American” based on a formula used by the Urban institute to categorize counties.
lender_grp
factor
The top 100 lenders are named and all others lumped into “Other”; the biggest one will come first.
loans
num
# of loans by zip code and lender group
amount
num
$ amount of the loans, according to the initial amount reported, by zip code and lender group
usps_businesses
num
total number of business addresses in the zip code
A factor works like words, but is always shown in the order you specify, so you don’t have to worry about it alphabetizing everything by default. It’s particularly useful for charting.
A searchable, sortable table
Here’s what the first 20 rows look like:
Our first chart
One of the first things a lot of people do with a large dataset is to create a histogram of the data. One example is the number of loans by zip code.
Create a histogram
Alberto Cairo talked about the elements of a data visualization: The container, the encodings and the annotations. The first thing to do is to create the container. But first, we need to get the data that we want into the form needed for a visualization. We aggregate the lenders to zip code totals, and compute the percentage of businesses that got loans. I’ll cap it at 500 % because there were some places with more loans than there were businesses - a result of not having a good list of the eligible businesses for the program.
pppdata <-
ppp %>%
group_by ( zcta, ethnicity) %>%
summarise ( loans = sum (loans) , businesses = first (usps_businesses)) %>%
mutate ( pct_loans = pmin (loans / businesses, 5 ))
my_histogram <-
ggplot ( data= pppdata,
mapping = aes ( x = pct_loans ))
produces:
Nothing! That’s because we haven’t yet told R what geometry to use. The choice here is geom_histogram()
Add an encoding
my_histogram +
geom_histogram ( )
Fix bins and colors
This is hard to read, not just because I haven’t formatted the percentages (so 5 = 500% of all businesses got loans), but because of the default number of “bins” chosen. Here’s how to clean that up a little. At the same time, I’ll apply a theme that’s a little easier to read, and set the outline (color) and the fill so that the bars aren’t smushed together.
The “Colors of R ” document from Columbia University shows you the names of the colors you can use.
my_histogram +
geom_histogram ( binwidth= .1 , color= "white" , fill= "slategrey" ) +
theme_light ()
All I did was make the bins 10% of each zip code’s businesses (0.1)
That’s ok, but let’s make the labels more accurate:
my_histogram +
geom_histogram ( binwidth= .1 , color= "white" , fill= "slategrey" ) +
theme_light () +
scale_x_continuous ( labels= percent) +
labs (x= "% of busineses that got loans (maxed at 500%)" , y = "# of zip codes" )
Adding complexity
Say we want to see a histogram by the ethnicity of the zip code. We might choose to see a different plot for each one. When we walk through it, I’ll go through each of these items one by one. I’ve started with a blank plot with data for each zip code, showing the number of loans and the ethnicity of each zip code.
myplot <-
pppdata %>%
filter ( ethnicity %in% c ("White" , "Hispanic" , "Native American" )) %>%
ggplot ( )
Building this a little at a time, we can save the plot and keep working on it later. The only thing changed was the “fill” color – an encoding - to be set by ethnicity of the zip code.
myplot <-
myplot +
( aes ( x= pct_loans , fill= ethnicity)) +
geom_histogram ( binwidth= .1 ,
color= "gray" ) +
scale_x_continuous ( labels= percent)
myplot
That’s not very helpful. Instead, let’s try it by putting each ethnicity on its own scale and graphic, using facets. facet_grid creates a grid of graphis, with the variable on the left defining the rows and the variable on the right of the tilde as the variable for columns. Use a period to leave either one blank.
myplot +
facet_grid ( ethnicity ~ . )
This still isn’t that helpful - the scales are the same, making it hard to see the shape. This time, we’ll save it and then print it out, meaning we don’t have to type anything over again:
myplot <-
myplot +
facet_grid ( ethnicity ~ . , scales= "free_y" )
myplot
Finally, let’s get rid of the legend and use some other colors :
myplot +
theme_light () +
theme ( legend.position= "none" ) +
scale_fill_hue ()
A bar chart
Remember that people can see the difference between lenghts of lines better than angles or sizes of circles. That’s why you see so many bar charts. Let’s take the top 50 lenders and compare the number of loans each has made.
lender_totals <-
ppp %>%
group_by ( lender_grp, ethnicity ) %>%
summarise ( loans = sum (loans)) %>%
filter ( as.integer ( lender_grp) <= 50 )
This creates a sideways bar chart, which is easier to read with a lot of categories (lenders) and long names:
lender_totals %>%
group_by ( lender_grp) %>%
summarise (loans = sum (loans)) %>%
ggplot ( aes ( x= lender_grp, y= loans)) +
geom_bar ( position= position_dodge (), stat= "identity" ) +
guides (fill = FALSE ) +
labs (x = NULL ) +
coord_flip ()
One useful thing is to reverse the order of the graphic so the biggest are on top.
myplot <-
lender_totals %>%
group_by ( lender_grp) %>%
summarise (loans = sum (loans)) %>%
ggplot ( aes ( x= reorder ( lender_grp, loans) , y= loans)) +
geom_bar ( position= position_dodge (), stat= "identity" ) +
guides (fill = "none" ) +
labs (x = "Company" ) +
coord_flip () +
theme_bw ()
myplot
Dot plots
Percentages by ethnicity
We’d like to compare the percent of loans within each lender for Hispanic, White and Native American zip codes. This is when we really can use the power of the facets to create a grid tiny little graphics – one for each lender!
Set up the data
Our goal is to make a series of little dot plots, sometimes called “Cleveland” plots, after the statistical expert that invented the small multiple faceted approach.
To see a big version, let’s make a Cleveland dot plot of the above bar chart instead. This time, the name of the bank is the Y axis and the number of loans is the x axis.
myplot <-
lender_totals %>%
group_by ( lender_grp) %>%
summarise (loans = sum (loans)) %>%
ggplot ( aes ( y= reorder ( lender_grp, loans) , x= loans)) +
geom_point (size= 4 , color= "black" ) +
theme_bw () +
theme ( panel.grid.major.x = element_blank (),
panel.grid.minor.x = element_blank (),
panel.grid.major.y = element_line (color = "gray60" , linetype= "dotted" )) +
labs ( y= NULL ) +
scale_x_continuous ( label= scales:: comma)
myplot
Facets for ethnicity
That was easy enough, and is even a little easier to read than the bars. Now let’s make three of them – one for each ethnicity. This time, we’ll just take the same thing, but start with the data that includes ethnicity
ppp %>%
group_by ( lender_grp, ethnicity) %>%
filter ( as.integer (lender_grp) <= 50 , ethnicity %in% c ("White" , "Native American" , "Hispanic" )) %>%
summarise (loans = sum (loans), .groups= "drop" ) %>%
ggplot ( aes ( y= reorder ( lender_grp, desc (lender_grp) ) , x= loans, color= ethnicity)) +
geom_point (size= 2 , color= "black" ) +
facet_grid ( . ~ ethnicity , scales= "free_x" ) +
theme_bw () +
scale_color_brewer () +
theme ( panel.grid.major.x = element_blank (),
panel.grid.minor.x = element_blank (),
panel.grid.major.y = element_line (color = "gray60" , linetype= "dotted" )) +
labs ( y= NULL ) +
scale_x_continuous ( label= scales:: comma)
But this isn’t very useful, since it is the raw number of loans for each. Instead, we’ll convert the data so that, within each ethnicity, we show the market share (the percent of all loans) that went to each group.
totals <-
az_ppp_zip %>%
group_by ( census_zip) %>%
summarise ( loans = n (), lenders= n_distinct (lender), amount= sum (initial_amt)) %>%
right_join ( az_by_zipcode, by= c ("census_zip" = "zcta" ) ) %>%
select ( zcta= census_zip, zcta_ethnic, loans: amount, usps_businesses) %>%
mutate (ethnicity = ordered (zcta_ethnic),
loans = replace_na (loans, 0 )) %>%
group_by (ethnicity) %>%
summarise ( ethnic_loans = sum (loans) , ethnic_businesses= sum (usps_businesses, na.rm= T), .groups= "drop" ) %>%
mutate ( grand_total_loans = sum (ethnic_loans), grand_total_businesses = sum (ethnic_businesses),
ethnic_pct = ethnic_loans / grand_total_loans)
levels (totals$ ethnicity) <- c ("No people" , "White" , "No distinct group" , "Black" , "Hispanic" , "Native American" )
options (scipen= 999 )
loan_percentages <-
ppp %>%
group_by ( ethnicity, lender_grp ) %>%
summarise ( loans = sum (loans) , .groups= "drop" ) %>%
left_join ( totals, by= "ethnicity" ) %>%
group_by ( lender_grp ) %>%
mutate ( lender_pct = loans / sum (loans),
lender_ratio = lender_pct / ethnic_pct ) %>%
filter ( as.integer (lender_grp) <= 50 , ethnicity %in% c ("White" , "Native American" , "Hispanic" )) %>%
ungroup () %>%
select ( ethnicity: ethnic_loans, ethnic_pct, lender_pct, lender_ratio) %>%
mutate (capped_ratio = pmin (lender_ratio, 1.25 ),
capped_ratio = pmax (capped_ratio, .4 ))
Make facets by zip code ethnicity
loan_percentages %>%
ggplot ( aes ( y= reorder ( lender_grp, desc (lender_grp) ) , x= capped_ratio, color= ethnicity)) +
geom_point (size= 2 ) +
facet_grid ( . ~ ethnicity , scales= "free_x" ) +
theme_bw () +
theme ( panel.grid.major.x = element_blank (),
panel.grid.minor.x = element_blank (),
panel.grid.major.y = element_line (color = "gray60" , linetype= "dotted" ),
legend.position = "none" ) +
geom_vline (xintercept= 1 , color= "gray" , linetype= "solid" ) +
labs ( y= NULL , x= NULL ) +
scale_x_continuous ( label= scales:: comma) +
scale_color_discrete ()
Make a chart for each bank:
Here’s an example of making a little chart for each bank. The value is the relative percentage of each bank’s loans to zip codes by ethnicity, compared with the total for Arizona. Anything higher than 1 means that the bank is disprportionately lending to that group; anything lower than 1 means that they are not lending in those areas at the rate that others do. This is another easy way to pick out banks to focus on, such as Kabbage in Hispanic or Latino areas, and Wells Fargo in Native American areas.
myplot <-
loan_percentages %>%
ggplot ( aes ( y= reorder (ethnicity, desc (ethnicity)), x= capped_ratio, color= ethnicity)) +
geom_point (size= 2 ) +
facet_wrap ( ~ lender_grp, scales= "free_x" ) +
theme_light () +
theme ( panel.grid.major.x = element_blank (),
panel.grid.minor.x = element_blank (),
panel.grid.major.y = element_line (color = "gray60" , linetype= "dotted" ),
legend.position = "bottom" ,
axis.text.y = element_blank (),
axis.text.x = element_blank (),
strip.text.x = element_text (hjust= - .01 , size= 8 , color= "grey60" ),
strip.background = element_rect (fill= NA , color= NA , linetype = "dotted" )) +
labs ( y= NULL , x= NULL ) +
scale_x_continuous ( limits= c (0 , 1.25 ) , label= scales:: comma) +
scale_color_discrete () +
geom_vline (xintercept= 1 , color= "gray" , linetype= "solid" )
myplot
---
title: Getting started with GGPLOT
date: "`r Sys.Date()`"
output:
  rmdformats::robobook:
    self_contained: true
    code_download: true
    code_folding: hide
    fig_width: 5
    fig_height: 3
    toc_depth: 3
---



This time I'm using the `robobook` markdown theme, in case you want to copy it. (Download the code to get the stuff you put at the top.) The libraries used are: 

        knitr             tidyverse          janitor         scales
        rmdformats        lubridate          forcats
        


```{r setup, include=FALSE, warning=FALSE} 


library(knitr)
library(rmdformats)

## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,
               tidy=TRUE,
               message=FALSE,
               warning=FALSE)
opts_knit$set(width=75)
```




```{r get_data, include=FALSE}

# leaving this in for creating the data but I'm not worrying about that for this presentation 

library(tidyverse)
library(lubridate)
library(janitor)
library(forcats) # working with factors
library(scales) # turning numbers into something readable



load ( url ("https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp_zipcodes.Rda?raw=true"))

ppp_sums <- 
  az_ppp_zip %>%
  mutate ( lender_grp = fct_lump ( fct_infreq ( lender), 100 )) %>%
  group_by ( census_zip, lender_grp ) %>%
  summarise ( loans = n(),  amount = sum(initial_amt), .groups="drop") 


ppp <- 
az_by_zipcode %>%
  select ( zcta, zipcode_city  , st_county_fip,  zcta_ethnic, usps_businesses ) %>%
  left_join (ppp_sums, by=c("zcta" = "census_zip")) %>%
  mutate ( ethnicity = ordered ( zcta_ethnic) , 
           loans = replace_na ( loans, 0)) %>%
  select ( zcta:st_county_fip, ethnicity, lender_grp, loans, amount, usps_businesses)

levels (ppp$ethnicity) <-c( "No people", "White", "No distinct group", "Black", "Hispanic", "Native American")
levels (ppp$ethnicity)


skimr::skim( ppp)


```


## The data

This data was created from an early version of the Paycheck Protection Program given in Arizona during the pandemic.  Here are the column definitions


variable name   |   type     |   description
--------------- | ---------  | -------------
zcta            | char       | Zip code used in the census
zipcode_city    | char       | Census name associated with that zip code
st_county_fip   | char       | FIP code for the county 
ethnicity       | factor     | County demographic: "No people", "White", "No distinct group", "Black", "Hispanic", "Native American" based on a formula used by the Urban institute to categorize counties. 
lender_grp      | factor     | The top 100 lenders are named and all others lumped into "Other"; the biggest one will come first. 
loans           | num        | # of loans by zip code and lender group 
amount          | num        | $ amount of the loans, according to the initial amount reported, by zip code and lender group
usps_businesses | num        | total number of business addresses in the zip code


A factor works like words, but is always shown in the order you specify, so you don't have to worry about it alphabetizing everything by default. It's particularly useful for charting. 


## A searchable, sortable table

Here's what the first 20 rows look like:

```{r orig_data_list, echo=FALSE}

library(reactable)
ppp %>%
arrange (desc ( loans)) %>%
head(20) %>%
reactable ( 
   defaultPageSize = 10,
   wrap=FALSE,
   searchable = TRUE,
    defaultColDef= 
        colDef (maxWidth=200, format=colFormat(separators=TRUE)),
    theme=reactableTheme( color="gray",
                          style=list (fontFamily = "Work Sans, sans-serif", fontSize="70%")),
    columns = list (
      zcta  = colDef(name="Zip code"), 
      zipcode_city = colDef(name="City"), 
      st_county_fip = colDef (name="FIP"),
      ethnicity  = colDef (name="Ethnicity"),
      lender_grp = colDef (name="Lender", minWidth=200),
      loans = colDef (name = "# loans"),
      amount = colDef (name="$ amount", 
                       minWidth=150, format=colFormat (separators=TRUE, currency="USD", digits=0)),
      usps_businesses = colDef(name= "# businesses"))
  )



```


## Our first chart


One of the first things  a lot of people do with a large dataset is to create a `histogram` of the data. One example is the number of loans by zip code.



### Create a histogram

Alberto Cairo talked about the elements of a data visualization: The container, the encodings and the annotations. The first thing to do is to create the container.  But first, we need to get the data that we want into the form needed for a visualization. We aggregate the lenders to zip code totals, and compute the percentage of businesses that got loans. I'll cap it at 500 % because there were some places with more loans than there were businesses - a result of not having a good list of the eligible businesses for the program. 

```{r build_hist1, class.source= 'fold-show'}

pppdata <- 
  ppp %>%
  group_by ( zcta, ethnicity) %>%
  summarise ( loans = sum(loans) , businesses = first(usps_businesses)) %>%
  mutate ( pct_loans = pmin (loans / businesses, 5)) 

my_histogram <- 
  ggplot ( data= pppdata, 
           mapping = aes ( x = pct_loans ))
  


```

produces: 

```{r print_shell}
my_histogram

```

Nothing! That's because we haven't yet told R what `geometry` to use. The choice here is `geom_histogram()`

#### Add an encoding 


```{r add_geom, class.source='fold-show'}

my_histogram + 
  geom_histogram( )

```




#### Fix bins and colors

This is hard to read, not just because I haven't formatted the percentages (so 5 = 500% of all businesses got loans), but because of the default number of "bins" chosen. Here's how to clean that up a little. At the same time, I'll apply a theme that's a little easier to read, and set the outline (color) and the fill so that the bars aren't smushed together. 

The "[Colors of R](http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf)" document from Columbia University shows you the names of the colors you can use. 


```{r fix_histogram, class.source='fold-show'}

my_histogram +
  geom_histogram ( binwidth=.1, color="white", fill="slategrey") +
  theme_light()


```
All I did was make the bins 10% of each zip code's businesses (0.1)

That's ok, but let's make the labels more accurate:

```{r}

my_histogram +
  geom_histogram ( binwidth=.1, color="white", fill="slategrey") +
  theme_light() +
  scale_x_continuous ( labels=percent) +
  labs  (x= "% of busineses that got loans (maxed at 500%)", y = "# of zip codes")


```



#### Adding complexity

Say we want to see a histogram by the ethnicity of the zip code. We might choose to see a different plot for each one. When we walk through it, I'll go through each of these items one by one. I've started with a blank plot with data for each zip code, showing the number of loans and the ethnicity of each zip code. 


```{r blank_plot_ethnic}

myplot <- 
  pppdata %>%
  filter ( ethnicity %in% c("White", "Hispanic", "Native American")) %>%
  ggplot ( )


```


Building this a little at a time, we can save the plot and keep working on it later. The only thing changed was the "fill" color -- an encoding - to be set by ethnicity of the zip code. 

```{r ethnic_hist , class.source="fold-show"}

myplot <- 
  myplot  + 
  ( aes ( x= pct_loans , fill= ethnicity)) + 
  geom_histogram( binwidth=.1, 
                  color="gray") + 
  scale_x_continuous( labels=percent)


myplot

```


That's not very helpful. Instead, let's try it by putting each ethnicity on its own scale and graphic, using facets. `facet_grid` creates a grid of graphis, with the variable on the left defining the rows and the variable on the right of the tilde as the variable for columns. Use a period to leave either one blank. 

```{r ethnic_hist1 , class.source="fold-show"}

myplot + 
  facet_grid (  ethnicity ~ . )

```


This still isn't that helpful - the scales are the same, making it hard to see the shape. This time, we'll save it and then print it out, meaning we don't have to type anything over again:

```{r ethnic_hist2 , class.source="fold-show"}

myplot <- 
  myplot + 
  facet_grid (  ethnicity ~ . , scales="free_y")


myplot

```


Finally, let's get rid of the legend and use some other colors :

```{r ethnic_hist3 , class.source="fold-show"}

myplot + 
  theme_light () + 
  theme ( legend.position="none") + 
  scale_fill_hue()

```




## A bar chart

Remember that people can see the difference between lenghts of lines better than angles or sizes of circles. That's why you see so many bar charts. Let's take the top 50 lenders and compare the number of loans each has made.  


```{r get_lender_data}

lender_totals <- 
  ppp %>%
  group_by ( lender_grp, ethnicity ) %>%
  summarise ( loans = sum(loans)) %>%
  filter ( as.integer( lender_grp) <=50 )


```


This creates a sideways bar chart, which is easier to read with a lot of categories (lenders) and long names:

```{r r_bar1 , class.source="fold-show", fig.height=9, fig.width=15}

lender_totals %>%
  group_by ( lender_grp) %>%
  summarise (loans = sum(loans)) %>%
  ggplot( aes( x= lender_grp, y= loans)) + 
    geom_bar ( position=position_dodge(), stat="identity")  +
    guides (fill = FALSE) + 
    labs (x = NULL) +
    coord_flip() 



```


One useful thing is to reverse the order of the graphic so the biggest are on top. 


```{r r_bar2 , class.source="fold-show", fig.height=9, fig.width=9}

myplot <- 
  lender_totals %>%
  group_by ( lender_grp) %>%
  summarise (loans = sum(loans)) %>%
  ggplot( aes( x= reorder( lender_grp, loans) , y= loans)) + 
    geom_bar ( position=position_dodge(), stat="identity")  +
    guides (fill = "none") + 
    labs (x = "Company") +
    coord_flip() +
    theme_bw() 

myplot
    
```


## Dot plots


### Percentages by ethnicity 

We'd like to compare the percent of loans within each lender for Hispanic, White and Native American zip codes. This is when we really can use the power of the facets to create a grid tiny little graphics -- one for each lender! 

#### Set up the data


Our goal is to make a series of little dot plots, sometimes called "Cleveland" plots, after the statistical expert that invented the small multiple faceted approach. 

To see a big version, let's make a Cleveland dot plot of the above bar chart instead. This time, the name of the bank is the Y axis and the number of loans is the x axis.

```{r class.source="fold-show", fig.height=12, fig.width=15}


myplot <- 
lender_totals %>%
  group_by ( lender_grp) %>%
  summarise (loans = sum(loans)) %>%
  ggplot( aes( y= reorder( lender_grp, loans) , x= loans)) +
  geom_point (size=4, color="black") +
  theme_bw() +
  theme ( panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(), 
          panel.grid.major.y = element_line (color = "gray60", linetype="dotted")) +
  labs ( y=NULL) +
  scale_x_continuous ( label=scales::comma)
  


myplot


```


### Facets for ethnicity

That was easy enough, and is even a little easier to read than the bars. Now let's make three of them -- one for each ethnicity. This time, we'll just take the same thing, but start with the data that includes ethnicity


```{r  make_cleveland1 , class.source="fold-show", fig.height=12, fig.width=12}


ppp %>%
  group_by ( lender_grp, ethnicity) %>%
  filter ( as.integer(lender_grp) <= 50, ethnicity %in% c("White", "Native American", "Hispanic")) %>% 
  summarise (loans = sum(loans), .groups="drop") %>%
  ggplot( aes( y= reorder ( lender_grp, desc(lender_grp)  ) , x= loans, color=ethnicity)) +
  geom_point (size=2, color="black") +
  facet_grid ( . ~ ethnicity , scales="free_x") +
  theme_bw() +
  scale_color_brewer () +
  theme ( panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(), 
          panel.grid.major.y = element_line (color = "gray60", linetype="dotted")) +
  labs ( y=NULL) +
  scale_x_continuous ( label=scales::comma) 
  

```


But this isn't very useful, since it is the raw number of loans for each. Instead, we'll convert the data so that, within each ethnicity, we show the market share (the percent of all loans) that went to each group.


```{r}

totals <- 
  az_ppp_zip %>%
  group_by ( census_zip) %>%
  summarise ( loans = n(), lenders=n_distinct (lender), amount=sum(initial_amt)) %>%
  right_join ( az_by_zipcode, by=c("census_zip"="zcta") ) %>%
  select ( zcta=census_zip, zcta_ethnic,  loans:amount, usps_businesses) %>%
  mutate (ethnicity = ordered (zcta_ethnic), 
          loans = replace_na (loans, 0)) %>%
  group_by (ethnicity) %>%
  summarise ( ethnic_loans = sum(loans) , ethnic_businesses=sum(usps_businesses, na.rm=T), .groups="drop")  %>%
  mutate ( grand_total_loans = sum(ethnic_loans), grand_total_businesses = sum(ethnic_businesses), 
           ethnic_pct = ethnic_loans / grand_total_loans)

levels(totals$ethnicity) <- c("No people", "White", "No distinct group", "Black", "Hispanic", "Native American")

options(scipen=999)

loan_percentages <- 
  ppp %>%
  group_by ( ethnicity, lender_grp ) %>%
  summarise ( loans = sum(loans) , .groups="drop") %>%
  left_join ( totals, by="ethnicity")  %>%
  group_by ( lender_grp ) %>%
  mutate ( lender_pct = loans / sum(loans), 
           lender_ratio = lender_pct / ethnic_pct ) %>%
  filter ( as.integer(lender_grp) <= 50, ethnicity %in% c("White", "Native American", "Hispanic")) %>%
  ungroup() %>%
  select ( ethnicity:ethnic_loans, ethnic_pct, lender_pct, lender_ratio) %>%
  mutate (capped_ratio = pmin (lender_ratio, 1.25), 
          capped_ratio = pmax (capped_ratio, .4))



```


### Make facets by zip code ethnicity


```{r  make_cleveland_data  , fig.height=12, fig.width=12}


loan_percentages %>%
  ggplot( aes( y= reorder ( lender_grp, desc(lender_grp)  ) , x= capped_ratio, color=ethnicity)) +
  geom_point (size=2) +
  facet_grid ( . ~ ethnicity , scales="free_x") +
  theme_bw() +
  theme ( panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(), 
          panel.grid.major.y = element_line (color = "gray60", linetype="dotted"), 
          legend.position = "none") +
  geom_vline (xintercept=1, color="gray", linetype="solid") +
  labs ( y=NULL, x=NULL) +
  scale_x_continuous ( label=scales::comma) + 
  scale_color_discrete()
  


```



### Make a chart for each bank: 

Here's an example of making a little chart for each bank. The value is the relative percentage of each bank's loans to zip codes by ethnicity, compared with the total for Arizona. Anything higher than 1 means that the bank is disprportionately lending to that group; anything lower than 1 means that they are not lending in those areas at the rate that others do. This is another easy way to pick out banks to focus on, such as Kabbage in Hispanic or Latino areas, and Wells Fargo in Native American areas.



```{r class.source="fold-show", fig.width=9, fig.height=9}


myplot <- 
  loan_percentages %>%
  ggplot( aes(  y=reorder(ethnicity, desc(ethnicity)), x= capped_ratio, color=ethnicity)) +
  geom_point (size=2) +
  facet_wrap (  ~ lender_grp, scales="free_x" ) +
  theme_light () +
  theme ( panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank(), 
          panel.grid.major.y = element_line (color = "gray60", linetype="dotted"), 
          legend.position = "bottom", 
          axis.text.y = element_blank(), 
          axis.text.x = element_blank(), 
          strip.text.x = element_text (hjust=-.01, size=8, color="grey60"),
          strip.background = element_rect (fill=NA, color= NA, linetype = "dotted")) +
  labs ( y=NULL, x=NULL) +
  scale_x_continuous (  limits=c(0, 1.25) , label=scales::comma) + 
  scale_color_discrete() +
  geom_vline (xintercept=1, color="gray", linetype="solid")



myplot

```


