Load packages and data
This data is county level data on the number of kids on Medicaid
library(socsci)
## Loading required package: tidyverse
## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: rlang
## Warning: package 'rlang' was built under R version 3.6.3
##
## Attaching package: 'rlang'
## The following objects are masked from 'package:purrr':
##
## %@%, as_function, flatten, flatten_chr, flatten_dbl, flatten_int,
## flatten_lgl, flatten_raw, invoke, list_along, modify, prepend,
## splice
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## Loading required package: broom
## Warning: package 'broom' was built under R version 3.6.3
## Loading required package: glue
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
## collapse
med <- read_csv("D://medicaid.csv")
## Parsed with column specification:
## cols(
## LocationType = col_character(),
## Location = col_character(),
## TimeFrame = col_character(),
## DataFormat = col_character(),
## Data = col_double()
## )
med
## # A tibble: 1,008 x 5
## LocationType Location TimeFrame DataFormat Data
## <chr> <chr> <chr> <chr> <dbl>
## 1 State Illinois FY 2005 Number 1157980
## 2 State Illinois FY 2006 Number 1214714
## 3 State Illinois FY 2007 Number 1363789
## 4 State Illinois FY 2008 Number 1455172
## 5 State Illinois FY 2009 Number 1553255
## 6 State Illinois FY 2010 Number 1630495
## 7 State Illinois FY 2011 Number 1679232
## 8 State Illinois FY 2012 Number 1697319
## 9 State Illinois FY 2013 Number 1647167
## 10 State Illinois FY 2014 Number 1572082
## # ... with 998 more rows
We need to clean it up - I don’t want the Illinois level data, just county. And just 2018
clean <- med %>%
filter(TimeFrame == "FY 2018") %>%
filter(Location != "Illinois") %>%
select(county = Location, med_kid = Data)
clean
## # A tibble: 102 x 2
## county med_kid
## <chr> <dbl>
## 1 Adams 11019
## 2 Alexander 1207
## 3 Bond 1619
## 4 Boone 6464
## 5 Brown 450
## 6 Bureau 3532
## 7 Calhoun 561
## 8 Carroll 1381
## 9 Cass 1991
## 10 Champaign 19873
## # ... with 92 more rows
Let’s say I want to see if there’s a relationship between total population of a county and kids on Medicaid.
kids <- read_csv("D://kids_pct.csv")
## Parsed with column specification:
## cols(
## county = col_character(),
## kids_pct = col_double()
## )
kids
## # A tibble: 102 x 2
## county kids_pct
## <chr> <dbl>
## 1 Adams 22.5
## 2 Alexander 23.1
## 3 Bond 19.4
## 4 Boone 24.8
## 5 Brown 15.9
## 6 Bureau 21.5
## 7 Calhoun 20.6
## 8 Carroll 19.4
## 9 Cass 23.8
## 10 Champaign 18.8
## # ... with 92 more rows
So now I have two datasets with the same variable name: county. I can left join those together
both <- left_join(clean, kids)
## Joining, by = "county"
both
## # A tibble: 102 x 3
## county med_kid kids_pct
## <chr> <dbl> <dbl>
## 1 Adams 11019 22.5
## 2 Alexander 1207 23.1
## 3 Bond 1619 19.4
## 4 Boone 6464 24.8
## 5 Brown 450 15.9
## 6 Bureau 3532 21.5
## 7 Calhoun 561 20.6
## 8 Carroll 1381 19.4
## 9 Cass 1991 23.8
## 10 Champaign 19873 18.8
## # ... with 92 more rows
Boom, now we have what we need. Let’s do a quick scatter plot.
both %>%
ggplot(., aes(x = kids_pct, y = med_kid)) +
geom_point() +
geom_smooth(method = lm)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
Obviously Cook County is the outlier there. I would want to remove it probably and then I could label the data with ggrepel
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 3.6.2
both %>%
filter(county != "Cook") %>%
ggplot(., aes(x = kids_pct, y = med_kid, label = county)) +
geom_point() +
geom_smooth(method = lm) +
geom_text_repel()
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_text_repel).