Left Joins

Load packages and data

This data is county level data on the number of kids on Medicaid

library(socsci)

## Loading required package: tidyverse

## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'dplyr' was built under R version 3.6.3

## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: rlang

## Warning: package 'rlang' was built under R version 3.6.3

## 
## Attaching package: 'rlang'

## The following objects are masked from 'package:purrr':
## 
##     %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
##     flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
##     splice

## Loading required package: scales

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

## Loading required package: broom

## Warning: package 'broom' was built under R version 3.6.3

## Loading required package: glue

## 
## Attaching package: 'glue'

## The following object is masked from 'package:dplyr':
## 
##     collapse

med <- read_csv("D://medicaid.csv")

## Parsed with column specification:
## cols(
##   LocationType = col_character(),
##   Location = col_character(),
##   TimeFrame = col_character(),
##   DataFormat = col_character(),
##   Data = col_double()
## )

med

## # A tibble: 1,008 x 5
##    LocationType Location TimeFrame DataFormat    Data
##    <chr>        <chr>    <chr>     <chr>        <dbl>
##  1 State        Illinois FY 2005   Number     1157980
##  2 State        Illinois FY 2006   Number     1214714
##  3 State        Illinois FY 2007   Number     1363789
##  4 State        Illinois FY 2008   Number     1455172
##  5 State        Illinois FY 2009   Number     1553255
##  6 State        Illinois FY 2010   Number     1630495
##  7 State        Illinois FY 2011   Number     1679232
##  8 State        Illinois FY 2012   Number     1697319
##  9 State        Illinois FY 2013   Number     1647167
## 10 State        Illinois FY 2014   Number     1572082
## # ... with 998 more rows

We need to clean it up - I don’t want the Illinois level data, just county. And just 2018

clean <- med %>% 
  filter(TimeFrame == "FY 2018") %>% 
  filter(Location != "Illinois") %>% 
  select(county = Location, med_kid = Data)
clean

## # A tibble: 102 x 2
##    county    med_kid
##    <chr>       <dbl>
##  1 Adams       11019
##  2 Alexander    1207
##  3 Bond         1619
##  4 Boone        6464
##  5 Brown         450
##  6 Bureau       3532
##  7 Calhoun       561
##  8 Carroll      1381
##  9 Cass         1991
## 10 Champaign   19873
## # ... with 92 more rows

Let’s say I want to see if there’s a relationship between total population of a county and kids on Medicaid.

kids <- read_csv("D://kids_pct.csv")

## Parsed with column specification:
## cols(
##   county = col_character(),
##   kids_pct = col_double()
## )

kids

## # A tibble: 102 x 2
##    county    kids_pct
##    <chr>        <dbl>
##  1 Adams         22.5
##  2 Alexander     23.1
##  3 Bond          19.4
##  4 Boone         24.8
##  5 Brown         15.9
##  6 Bureau        21.5
##  7 Calhoun       20.6
##  8 Carroll       19.4
##  9 Cass          23.8
## 10 Champaign     18.8
## # ... with 92 more rows

So now I have two datasets with the same variable name: county. I can left join those together

both <- left_join(clean, kids)

## Joining, by = "county"

both

## # A tibble: 102 x 3
##    county    med_kid kids_pct
##    <chr>       <dbl>    <dbl>
##  1 Adams       11019     22.5
##  2 Alexander    1207     23.1
##  3 Bond         1619     19.4
##  4 Boone        6464     24.8
##  5 Brown         450     15.9
##  6 Bureau       3532     21.5
##  7 Calhoun       561     20.6
##  8 Carroll      1381     19.4
##  9 Cass         1991     23.8
## 10 Champaign   19873     18.8
## # ... with 92 more rows

Boom, now we have what we need. Let’s do a quick scatter plot.

both %>% 
  ggplot(., aes(x = kids_pct, y = med_kid)) +
  geom_point() +
  geom_smooth(method = lm)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Obviously Cook County is the outlier there. I would want to remove it probably and then I could label the data with ggrepel

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 3.6.2

both %>%
  filter(county != "Cook") %>% 
  ggplot(., aes(x = kids_pct, y = med_kid, label = county)) +
  geom_point() +
  geom_smooth(method = lm) +
  geom_text_repel()

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_text_repel).

Left Joins

Ryan Burge

4/23/2020