0.1 Note to Students

Dr. Love updated this document on 2020-09-27 to add some good ideas from an initial set of proposal reviews.

This proposal includes a lot of description from Dr. Love about what he’s doing and what’s happening in the R code chunks that should not be included in your proposal (as an example, this whole section shouldn’t be in your proposal.) It also doesn’t include several things that you will need to include in your proposal.

Think of this document as an annotated starting point for thinking about developing your proposal, rather than as a rigid template that just requires you to fill in a few gaps. There is still a lot of work for you to do. Your job in building your proposal requires you to (at a minimum):

  1. adapt the code provided here to address your own decisions and requirements (more than just filling in your title and name, although that’s an important thing to do.)
  2. edit what is provided here so that you wind up only including things that are appropriate for your project
  3. write your own descriptions of the states/measures you’re using and the results you obtain (which Dr. Love has mostly left out of this document.)
  4. knit the R Markdown document into an HTML or PDF report, and then proofreading and spell-checking all of your work before you submit it.

You should be certain you have a real title and author list in this file.

1 Preliminaries

1.1 My R Packages

library(janitor)
library(magrittr)
library(tidyverse)

Note that I have loaded the tidyverse last, and that I have not loaded any of the tidyverse packages individually. We’ll be checking to see that you’ve done this properly. These are the three packages that Dr. Love has used in preparing this proposal, and don’t include packages (like patchwork and broom, for instance) that he almost certainly would need to use in his analyses, yet. The final project itself should include all packages that get used.

1.2 Data Ingest

Note that Dr. Love is working here with 2019 data, rather than 2020, as you’ll use. The guess_max result ensures that read_csv will look through the entire data set (which has less than 4000 rows) instead of just the first 1000 rows (which is the default.)

The code below actually loads in the data from County Health Rankings directly, using the 2019 period.

data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2019.csv"
chr_2019_raw <- read_csv(data_url, skip = 1, guess_max = 4000)

Note that you’ll need a different data_url (listed below) for the 2020 data.

data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2020_0.csv"

2 Data Development

2.1 Selecting My Data

I’ll be selecting data from the six “states” (Washington DC, Delaware, Connecticut, Hawaii, New Hampshire and Rhode Island) that are not available to you (because they each have only a few counties: in total there are just 31 counties in those six states.) Note that in your work, you will include Ohio, and other states, but all of the states I’ve selected are not available to you. Also, you’ll have to describe a reason why you selected your group of states, which I’ll skip here.

I’ve selected five variables (v147, v145, v021, v023 and v139) which I’ll describe shortly. You will make your own choices, of course, and you’ll need to provide more information on each variable in a codebook.

To help you think about the chunk of code below, note that the code below does the following things:

  1. Filter the data to the actual counties that are ranked in the Rankings (this eliminates state and USA totals, mainly.)
  2. Filter to the states we’ve selected (the %in% command lets us include any state that is in the list we then create with the c() function).
  3. Select the variables that we’re going to use in our study, including the three mandatory variables (fipscode, state and county).
  4. Rename the five variables we’ve selected with more meaningful names. These names are motivated by the actual meaning of the variables, as shown in the top row (that we deleted) in the original csv, the PDF files I’ve included for you, and the more detailed variable descriptions on the County Health Ranking site.
chr_2019 <- chr_2019_raw %>%
    filter(county_ranked == 1) %>%
    filter(state %in% c("DC", "DE", "CT", "HI", "NH", "RI")) %>%
    select(fipscode, state, county, 
           v147_rawvalue, v145_rawvalue, v021_rawvalue, 
           v023_rawvalue, v139_rawvalue) %>%
    rename(life_expectancy = v147_rawvalue,
           freq_mental_distress = v145_rawvalue,
           hsgraduation = v021_rawvalue,
           unemployment = v023_rawvalue,
           food_insecurity = v139_rawvalue)

2.2 Repairing the fipscode and factoring the state

The fipscode is just a numerical code, and not a meaningful number (so that, for instance, calculating the mean of fipscode would make no sense.) To avoid confusion later, it’s worth it to tell R to treat fipscode as a character variable, rather than a double-precision numeric variable.

But there’s a problem with doing this, as R has already missed the need to pull in some leading zeros (the FIPS code is a 5-digit number which identifies a state (with the first two digits) and then a county (with the remaining three digits) but by reading the fipscode in as a numeric variable, some of the values you wind up with will be from states that need an opening zero in order to get to five digits total.)

We can fix this by applying a function from the stringr package (part of the tidyverse,) which will both add a “zero” to any fips code with less than 5 digits, but will also turn fipscode into a character variable, which is a better choice for a numeric code.

It will also be helpful later to include state as a factor variable, rather than a character.

We can accomplish these two tasks with the following chunk of code.

chr_2019 <- chr_2019 %>%
    mutate(fipscode = str_pad(fipscode, 5, pad = "0"),
           state = factor(state))

You can certainly use as.factor instead of factor here if you like. If you wish to arrange the levels of your states factor in an order other than alphabetically by postal abbreviation (perhaps putting Ohio first or something), then you could do so with fct_recode(), but I won’t do that here.

2.2.1 Checking Initial Work

Given the “states” I selected, I should have 31 rows, since there are 31 counties across those states, and I should have 8 variables. It’s also helpful to glimpse through the data and be sure nothing strange has happened in terms of what the first few values look like. Note the leading zeros in fipscode (and that it’s now a character variable) and that state is now a factor, as we’d hoped.

glimpse(chr_2019)
Rows: 31
Columns: 8
$ fipscode             <chr> "09001", "09003", "09005", "09007", "09009", "...
$ state                <fct> CT, CT, CT, CT, CT, CT, CT, CT, DE, DE, DE, DC...
$ county               <chr> "Fairfield County", "Hartford County", "Litchf...
$ life_expectancy      <dbl> 82.57683, 80.31690, 80.69485, 81.06904, 79.981...
$ freq_mental_distress <dbl> 0.09663473, 0.10244227, 0.09852291, 0.09609184...
$ hsgraduation         <dbl> 0.8942186, 0.8551798, 0.9043855, 0.9460923, 0....
$ unemployment         <dbl> 0.04506580, 0.04842678, 0.04308845, 0.04052900...
$ food_insecurity      <dbl> 0.096, 0.119, 0.095, 0.099, 0.124, 0.115, 0.09...

Looks good. I can check to see that each of my states has the anticipated number of counties, too.

chr_2019 %>% tabyl(state) %>% adorn_pct_formatting() 
 state  n percent
    CT  8   25.8%
    DC  1    3.2%
    DE  3    9.7%
    HI  4   12.9%
    NH 10   32.3%
    RI  5   16.1%

OK. These results match up with what I was expecting.

2.3 Creating Binary Categorical Variables

First, I’m going to make a binary categorical variable using the unemployment variable. Note that categorizing a quantitative variable like this is (in practice) a terrible idea, but we’re doing it here so that you can demonstrate some facility with modeling using a categorical variable.

We have numerous options for creating a binary variable.

2.3.1 Splitting into two categories based on the median

chr_2019 <- chr_2019 %>%
    mutate(temp1_ms = case_when(
                   unemployment < median(unemployment) ~ "low",
                   TRUE ~ "high"),
           temp1_ms = factor(temp1_ms))

mosaic::favstats(unemployment ~ temp1_ms, data = chr_2019) %>% 
    kable(digits = 3)
temp1_ms min Q1 median Q3 max mean sd n missing
high 0.039 0.041 0.045 0.049 0.061 0.046 0.005 16 0
low 0.022 0.024 0.026 0.028 0.038 0.027 0.005 15 0

2.3.2 Splitting into two categories based on a specific value

chr_2019 <- chr_2019 %>%
    mutate(temp2_4pct = case_when(
                   unemployment < 0.04 ~ "below4percent",
                   TRUE ~ "above4percent"),
           temp2_4pct = factor(temp2_4pct))

mosaic::favstats(unemployment ~ temp2_4pct, data = chr_2019) %>% 
    kable(digits = 3)
temp2_4pct min Q1 median Q3 max mean sd n missing
above4percent 0.040 0.042 0.045 0.049 0.061 0.046 0.005 15 0
below4percent 0.022 0.024 0.026 0.028 0.039 0.027 0.005 16 0

2.3.3 Using cut2 from Hmisc to split into two categories as evenly as possible

chr_2019 <- chr_2019 %>%
    mutate(temp3_cut2 = factor(Hmisc::cut2(unemployment, g = 2)))

mosaic::favstats(unemployment ~ temp3_cut2, data = chr_2019) %>% 
    kable(digits = 3)
temp3_cut2 min Q1 median Q3 max mean sd n missing
[0.0218,0.0400) 0.022 0.024 0.026 0.028 0.039 0.027 0.005 16 0
[0.0400,0.0605] 0.040 0.042 0.045 0.049 0.061 0.046 0.005 15 0

This approach is nice in one way, because it specifies the groups with a mathematical interval, but those factor level names can be rather unwieldy in practice. I might tweak them:

chr_2019 <- chr_2019 %>%
    mutate(temp3_cut2 = factor(Hmisc::cut2(unemployment, g = 2)),
           temp4_newnames = fct_recode(temp3_cut2,
                                         lessthan4 = "[0.0218,0.0400)",
                                         higher = "[0.0400,0.0605]"))

mosaic::favstats(unemployment ~ temp4_newnames, data = chr_2019) %>% 
    kable(digits = 3)
temp4_newnames min Q1 median Q3 max mean sd n missing
lessthan4 0.022 0.024 0.026 0.028 0.039 0.027 0.005 16 0
higher 0.040 0.042 0.045 0.049 0.061 0.046 0.005 15 0

2.3.4 Cleaning up

So, I’ve created four different variables here, when I only need the one. I’ll go with the median split approach, (which I’ll rename unemp_cat) and then drop the other attempts I created from my tibble in this next bit of code. Notice the use of the minus sign (-) before the list of variables I’m dropping in the select statement.

chr_2019 <- chr_2019 %>%
    rename(unemp_cat = temp1_ms) %>%
    select(-c(temp2_4pct, temp3_cut2, temp4_newnames))

Let’s check - we should still have 31 rows, but now we should have 9 columns (variables), since we’ve added the unemp_cat column to the data.

names(chr_2019)
[1] "fipscode"             "state"                "county"              
[4] "life_expectancy"      "freq_mental_distress" "hsgraduation"        
[7] "unemployment"         "food_insecurity"      "unemp_cat"           
nrow(chr_2019)
[1] 31

OK. Still looks fine.

2.4 Creating Multi-Category Variables

Now, I’m going to demonstrate the creation of a multi-category variable based on the hsgraduation variable. I’ll briefly reiterate that categorizing a quantitative variable like this is (in practice) a terrible, no good, very bad idea, but we’re doing it anyway for pedagogical rather than scientific reasons.

2.4.1 Creating a Three-Category Variable

Suppose we want to create three groups of equal size (which, since we have only 31 observations and need to have at least 10 in each group, is really our only choice in my example) and want to use the cut2 function from the Hmisc package.

chr_2019 <- chr_2019 %>%
    mutate(temp3 = factor(Hmisc::cut2(hsgraduation, g = 3)))

mosaic::favstats(hsgraduation ~ temp3, data = chr_2019) %>% 
    kable(digits = 3)
temp3 min Q1 median Q3 max mean sd n missing
[0.724,0.880) 0.724 0.812 0.840 0.852 0.877 0.826 0.042 11 0
[0.880,0.909) 0.880 0.882 0.888 0.894 0.904 0.889 0.009 10 0
[0.909,0.946] 0.909 0.923 0.930 0.936 0.946 0.930 0.011 10 0
chr_2019 <- chr_2019 %>%
    mutate(hsgrad_cat = fct_recode(temp3,
                                   bottom = "[0.724,0.880)",
                                   middle = "[0.880,0.909)",
                                   top = "[0.909,0.946]"))

mosaic::favstats(hsgraduation ~ hsgrad_cat, data = chr_2019) %>% 
    kable(digits = 3)
hsgrad_cat min Q1 median Q3 max mean sd n missing
bottom 0.724 0.812 0.840 0.852 0.877 0.826 0.042 11 0
middle 0.880 0.882 0.888 0.894 0.904 0.889 0.009 10 0
top 0.909 0.923 0.930 0.936 0.946 0.930 0.011 10 0
  1. Note that this same approach (changing g to 4 or 5 as appropriate) could be used to create a 4-category or 5-category variable.
  2. Note also that I used (bottom, middle, top) as the names of my categories instead of, for instance, (low, middle, high).
    • I did this so that R’s default factor sorting (which is alphabetical) would still give me a reasonable order. Otherwise, I’d need to add a fct_relevel step to sort the categories by hand in some reasonable way.
    • Another good trick might have been to precede names that wouldn’t be in the order I want them alphabetically with a number so they sort in a sensible order, perhaps with (1_high, 2_med, 3_low.)

2.4.2 Creating a 5-Category variable with Specified Cutpoints

Suppose we want to split our hsgraduation data so that we have five categories, based on the cutpoints (0.8, 0.85, 0.9 and 0.92). These four cutpoints will produce five mutually exclusive (no county can be in more than one category) and collectively exhaustive (every county is assigned to a category) categories:

  1. hsgraduation rate below 0.80,
  2. 0.80 up to but not including 0.85,
  3. 0.85 up to but not including 0.90,
  4. 0.90 up to but not including 0.92, and
  5. hsgraduation rate of 0.92 or more
chr_2019 <- chr_2019 %>%
    mutate(temp4 = case_when(
        hsgraduation < 0.8 ~ "1_lowest",
        hsgraduation < 0.85 ~ "2_low",
        hsgraduation < 0.9 ~ "3_middle",
        hsgraduation < 0.92 ~ "4_high",
        TRUE ~ "5_highest"),
        temp4 = factor(temp4))

mosaic::favstats(hsgraduation ~ temp4, data = chr_2019) %>% 
    kable(digits = 3)
temp4 min Q1 median Q3 max mean sd n missing
1_lowest 0.724 0.740 0.756 0.773 0.789 0.756 0.046 2 0
2_low 0.801 0.826 0.837 0.842 0.848 0.831 0.017 6 0
3_middle 0.855 0.878 0.882 0.888 0.894 0.880 0.013 11 0
4_high 0.901 0.903 0.907 0.912 0.920 0.909 0.008 4 0
5_highest 0.923 0.928 0.931 0.939 0.946 0.933 0.009 8 0

I’ll just note that it is also possible to set cutpoints with Hmisc::cut2.

2.4.3 Cleaning up

So, I’ve created two multi-categorical variables, but I will just retain the 3-category version (which I called hsgrad_cat) and drop the other temporary efforts.

chr_2019 <- chr_2019 %>%
    select(-c(temp3, temp4))

2.5 Structure of My Tibble

Next, I’ll print the structure of my tibble. I’m checking to see that:

  • the initial row tells me that this is a tibble and specifies its dimensions
  • I still have the complete set of 31 rows (counties)
  • I’ve included only 10 variables:
    • the three required variables fipscode, county and state, where I’ll also check that fipscode and county should be character () variables, and state should be a factor variable (), with an appropriate number of levels
    • my original five selected variables, properly renamed and all of numerical () type (this may also be specified as double-precision or , which is fine)
    • my two categorical variables unemp_cat and hsgrad_cat which should each be factors with appropriate levels specified, followed by numerical codes
str(chr_2019)
tibble [31 x 10] (S3: tbl_df/tbl/data.frame)
 $ fipscode            : chr [1:31] "09001" "09003" "09005" "09007" ...
 $ state               : Factor w/ 6 levels "CT","DC","DE",..: 1 1 1 1 1 1 1 1 3 3 ...
 $ county              : chr [1:31] "Fairfield County" "Hartford County" "Litchfield County" "Middlesex County" ...
 $ life_expectancy     : num [1:31] 82.6 80.3 80.7 81.1 80 ...
 $ freq_mental_distress: num [1:31] 0.0966 0.1024 0.0985 0.0961 0.1109 ...
 $ hsgraduation        : num [1:31] 0.894 0.855 0.904 0.946 0.834 ...
 $ unemployment        : num [1:31] 0.0451 0.0484 0.0431 0.0405 0.0503 ...
 $ food_insecurity     : num [1:31] 0.096 0.119 0.095 0.099 0.124 0.115 0.099 0.111 0.13 0.115 ...
 $ unemp_cat           : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 1 1 ...
 $ hsgrad_cat          : Factor w/ 3 levels "bottom","middle",..: 2 1 2 3 1 2 3 1 2 1 ...

Looks good so far. I think we are ready to go.

3 Codebook

This is a table listing all 10 variables that are included in your tibble, and providing some important information about them, mostly drawn from the County Health Ranking web site. For each of your five selected variables, be sure to include the original code (vXXX) from the raw file.

Variable Description
fipscode FIPS code
state State: my six states are CT, DC, DE, HI, NH, RI
county County Name
life_expectancy (v147) Life Expectancy, which will be my outcome
freq_mental_distress (v145) Frequent Mental Distress Rate
hsgraduation (v021) High School Graduation Rate
unemployment (v023) Unemployment Rate
food_insecurity (v139) Food Insecurity Rate
unemp_cat 2 levels: low = unemployment below 3.9%, or high
hsgrad_cat 3 levels: bottom = hsgraduation below 88%, middle or top = 90.9% or above

Note that I’ve provided details on the definition of our categorical variables.

More details on two of our original five variables are specified below. These results are rephrased versions of the summaries linked on the County Health Rankings site. You’ll need to provide information of this type as part of the codebook for all five of your selected variables.

  • lifeexpectancy was originally variable v147_rawvalue, and is listed in the Length of Life subcategory under Health Outcomes at County Health Rankings. It describes the average number of years a person residing in the county can expect to live, according to the current mortality experience (age-specific death rates) of the county’s population. It is based on data from the National Center for Health Statistics Mortality Files from 2016-18. This will be my outcome variable.

  • hsgraduation was originally variable v021_rawvalue, and is listed in the Education subcategory under Social & Economic Factors at County Health Rankings. It describes the proportion of the county’s ninth grade cohort that graduates with a high school diploma in four years, and is based on EDFacts data from 2016-17. Comparisons across state lines are not recommended because of differences in how states define the data, according to County Health Rankings.

3.1 Proposal Requirement 1

Remember that you will need to do five things in the proposal.

  1. a sentence or two (perhaps accompanied by a small table of R results) specifying the 4-6 states you chose, and the number of counties you are studying in total and within each state. In an additional sentence or two, provide some motivation for why you chose those states.

3.2 Proposal Requirement 2

  1. A list of the five variables (including their original raw names and your renamed versions) you are studying, with a clear indication of the cutpoints you chose to create the binary categories out of variable 4 and the multiple categories out of variable 5. Think of this as an early version of what will eventually become your codebook. For each variable, provide a sentence describing your motivation for why this variable was interesting to you, and also please specify which of your quantitative variables will serve as your outcome.

3.3 Proposal Requirement 3

Print the tibble, so we can verify that it is, in fact, a tibble, that prints the first 10 rows.

chr_2019
# A tibble: 31 x 10
   fipscode state county life_expectancy freq_mental_dis~ hsgraduation
   <chr>    <fct> <chr>            <dbl>            <dbl>        <dbl>
 1 09001    CT    Fairf~            82.6           0.0966        0.894
 2 09003    CT    Hartf~            80.3           0.102         0.855
 3 09005    CT    Litch~            80.7           0.0985        0.904
 4 09007    CT    Middl~            81.1           0.0961        0.946
 5 09009    CT    New H~            80.0           0.111         0.834
 6 09011    CT    New L~            79.9           0.109         0.890
 7 09013    CT    Tolla~            81.8           0.100         0.923
 8 09015    CT    Windh~            78.8           0.114         0.848
 9 10001    DE    Kent ~            77.8           0.117         0.901
10 10003    DE    New C~            78.7           0.110         0.877
# ... with 21 more rows, and 4 more variables: unemployment <dbl>,
#   food_insecurity <dbl>, unemp_cat <fct>, hsgrad_cat <fct>

3.4 Proposal Requirement 4

To meet proposal requirement 4, run describe from the Hmisc package.

Hmisc::describe(chr_2019)
chr_2019 

 10  Variables      31  Observations
--------------------------------------------------------------------------------
fipscode 
       n  missing distinct 
      31        0       31 

lowest : 09001 09003 09005 09007 09009, highest: 44001 44003 44005 44007 44009
--------------------------------------------------------------------------------
state 
       n  missing distinct 
      31        0        6 

lowest : CT DC DE HI NH, highest: DC DE HI NH RI
                                              
Value         CT    DC    DE    HI    NH    RI
Frequency      8     1     3     4    10     5
Proportion 0.258 0.032 0.097 0.129 0.323 0.161
--------------------------------------------------------------------------------
county 
       n  missing distinct 
      31        0       30 

lowest : Belknap County    Bristol County    Carroll County    Cheshire County   Coos County      
highest: Sullivan County   Sussex County     Tolland County    Washington County Windham County   
--------------------------------------------------------------------------------
life_expectancy 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1    79.99    1.719    78.04    78.34 
     .25      .50      .75      .90      .95 
   78.98    79.92    81.02    81.99    82.56 

lowest : 76.76078 77.80577 78.27487 78.34321 78.41546
highest: 81.80405 81.99087 82.53515 82.57683 82.66931
--------------------------------------------------------------------------------
freq_mental_distress 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1   0.1102  0.01115  0.09484  0.09663 
     .25      .50      .75      .90      .95 
 0.10358  0.11089  0.11691  0.12052  0.12094 

lowest : 0.08533063 0.09358767 0.09609184 0.09663473 0.09852291
highest: 0.11844904 0.12052429 0.12054963 0.12133815 0.13272832
--------------------------------------------------------------------------------
hsgraduation 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1   0.8799  0.05573   0.7950   0.8235 
     .25      .50      .75      .90      .95 
  0.8517   0.8854   0.9214   0.9319   0.9413 

lowest : 0.7236731 0.7892854 0.8007369 0.8235494 0.8338990
highest: 0.9302662 0.9318966 0.9370460 0.9455388 0.9460923
--------------------------------------------------------------------------------
unemployment 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1  0.03647  0.01256  0.02219  0.02353 
     .25      .50      .75      .90      .95 
 0.02608  0.03922  0.04503  0.04974  0.05024 

lowest : 0.02181668 0.02198898 0.02239195 0.02353413 0.02358351
highest: 0.04842678 0.04973634 0.05018781 0.05028742 0.06051724
--------------------------------------------------------------------------------
food_insecurity 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       21    0.994   0.1058  0.01585   0.0895   0.0900 
     .25      .50      .75      .90      .95 
  0.0975   0.1040   0.1150   0.1240   0.1290 

lowest : 0.074 0.089 0.090 0.093 0.095, highest: 0.119 0.124 0.128 0.130 0.132
--------------------------------------------------------------------------------
unemp_cat 
       n  missing distinct 
      31        0        2 
                      
Value       high   low
Frequency     16    15
Proportion 0.516 0.484
--------------------------------------------------------------------------------
hsgrad_cat 
       n  missing distinct 
      31        0        3 
                               
Value      bottom middle    top
Frequency      11     10     10
Proportion  0.355  0.323  0.323
--------------------------------------------------------------------------------

3.5 Three Important Checks

There are three important things I have to demonstrate, as described in Tasks C (Identify Your Variables) and D (Create Categorical Variables) in our Data Development work. They are:

  • Each of the five variables you select must have data for at least 75% of the counties in each state you plan to study.

Do we have any missing data here?

chr_2019 %>% 
    summarize(across(life_expectancy:food_insecurity, ~ sum(is.na(.))))
# A tibble: 1 x 5
  life_expectancy freq_mental_distress hsgraduation unemployment food_insecurity
            <int>                <int>        <int>        <int>           <int>
1               0                    0            0            0               0

Nope, so we’re OK!

If I did have some missingness, then I would probably want to summarize this by state, so that I could compare the results. Here’s a way to look at this just for the life_expectancy variable.

mosaic::favstats(life_expectancy ~ state, data = chr_2019) %>%
    select(state, n, missing) %>%
    mutate(pct_available = 100*(n - missing)/n) %>%
    kable()
state n missing pct_available
CT 8 0 100
DC 1 0 100
DE 3 0 100
HI 4 0 100
NH 10 0 100
RI 5 0 100

We’re OK, because 100% of the data are available. In my example, this is true for all five of the variables I used. In yours, that may or may not be the case. Remember that all of your selected variables need to be available in at least 75% of the counties in EACH state you study.

  • The raw versions of each of your five selected variables must have at least 10 distinct non-missing values.
chr_2019 %>% 
    summarize(across(life_expectancy:food_insecurity, ~ n_distinct(.)))
# A tibble: 1 x 5
  life_expectancy freq_mental_distress hsgraduation unemployment food_insecurity
            <int>                <int>        <int>        <int>           <int>
1              31                   31           31           31              21

OK. We’re fine there.

  • For each of the categorical variables you create, every level of the resulting factor must include at least 10 counties.
chr_2019 %>% tabyl(unemp_cat)
 unemp_cat  n  percent
      high 16 0.516129
       low 15 0.483871
chr_2019 %>% tabyl(hsgrad_cat)
 hsgrad_cat  n   percent
     bottom 11 0.3548387
     middle 10 0.3225806
        top 10 0.3225806

OK. I have at least 10 counties in each category for each of the categorical variables that I created.

3.6 Saving the Tibble

Finally, we’ll save this tibble as an R data set into the same location as our original data set within our R Project directory.

saveRDS(chr_2019, file = "chr_2019_Thomas_Love.Rds")

You’ll want to substitute in your own name, of course.

3.7 Proposal Requirement 5

Having done all of this work, the set of Proposal Requirements (repeated below) should be straightforward. We’ve already dealt with the first four. The fifth is repeated below.

  1. In a paragraph, describe the most challenging (or difficult) part of completing the work so far, and how you were able to overcome whatever it was that was difficult.

OK. That’s your job.

4 Analyses

This isn’t part of the proposal.

5 Session Information

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.2    mosaicData_0.20.1 ggformula_0.9.4   ggstance_0.3.4   
 [5] Matrix_1.2-18     lattice_0.20-41   forcats_0.5.0     stringr_1.4.0    
 [9] dplyr_1.0.2       purrr_0.3.4       readr_1.3.1       tidyr_1.1.2      
[13] tibble_3.0.3      ggplot2_3.3.2     tidyverse_1.3.0   magrittr_1.5     
[17] janitor_2.0.1     rmdformats_0.3.7  knitr_1.29       

loaded via a namespace (and not attached):
 [1] fs_1.5.0            lubridate_1.7.9     RColorBrewer_1.1-2 
 [4] httr_1.4.2          tools_4.0.2         backports_1.1.10   
 [7] utf8_1.1.4          R6_2.4.1            rpart_4.1-15       
[10] Hmisc_4.4-1         DBI_1.1.0           colorspace_1.4-1   
[13] nnet_7.3-14         withr_2.2.0         tidyselect_1.1.0   
[16] gridExtra_2.3       leaflet_2.0.3       curl_4.3           
[19] compiler_4.0.2      cli_2.0.2           rvest_0.3.6        
[22] htmlTable_2.1.0     xml2_1.3.2          ggdendro_0.1.22    
[25] bookdown_0.20       checkmate_2.0.0     mosaicCore_0.8.0   
[28] scales_1.1.1        digest_0.6.25       foreign_0.8-80     
[31] rmarkdown_2.3.3     jpeg_0.1-8.1        base64enc_0.1-3    
[34] pkgconfig_2.0.3     htmltools_0.5.0     dbplyr_1.4.4       
[37] highr_0.8           htmlwidgets_1.5.1   rlang_0.4.7        
[40] readxl_1.3.1        rstudioapi_0.11     farver_2.0.3       
[43] generics_0.0.2      jsonlite_1.7.1      crosstalk_1.1.0.1  
[46] Formula_1.2-3       Rcpp_1.0.5          munsell_0.5.0      
[49] fansi_0.4.1         lifecycle_0.2.0     stringi_1.5.3      
[52] yaml_2.2.1          snakecase_0.11.0    MASS_7.3-53        
[55] plyr_1.8.6          grid_4.0.2          blob_1.2.1         
[58] ggrepel_0.8.2       crayon_1.3.4        haven_2.3.1        
[61] splines_4.0.2       hms_0.5.3           pillar_1.4.6       
[64] reprex_0.3.0        glue_1.4.2          evaluate_0.14      
[67] latticeExtra_0.6-29 data.table_1.13.0   modelr_0.1.8       
[70] png_0.1-7           vctrs_0.3.4         tweenr_1.0.1       
[73] cellranger_1.1.0    gtable_0.3.0        polyclip_1.10-0    
 [ reached getOption("max.print") -- omitted 8 entries ]