0.1 Note to Students

Dr. Love updated this document on 2020-10-08 to add an outline for the Analysis section.

This proposal includes a lot of description from Dr. Love about what he’s doing and what’s happening in the R code chunks that should not be included in your proposal (as an example, this whole section shouldn’t be in your proposal.) It also doesn’t include several things that you will need to include in your proposal.

Think of this document as an annotated starting point for thinking about developing your proposal, rather than as a rigid template that just requires you to fill in a few gaps. There is still a lot of work for you to do. Your job in building your proposal requires you to (at a minimum):

  1. adapt the code provided here to address your own decisions and requirements (more than just filling in your title and name, although that’s an important thing to do.)
  2. edit what is provided here so that you wind up only including things that are appropriate for your project
  3. write your own descriptions of the states/measures you’re using and the results you obtain (which Dr. Love has mostly left out of this document.)
  4. knit the R Markdown document into an HTML or PDF report, and then proofreading and spell-checking all of your work before you submit it.

You should be certain you have a real title and author list in this file.

1 Preliminaries

1.1 My R Packages

library(janitor)
library(magrittr)
library(tidyverse)

Note that I have loaded the tidyverse last, and that I have not loaded any of the tidyverse packages individually. We’ll be checking to see that you’ve done this properly. These are the three packages that Dr. Love has used in preparing this proposal, and don’t include packages (like patchwork and broom, for instance) that he almost certainly would need to use in his analyses, yet. The final project itself should include all packages that get used.

1.2 Data Ingest

Note that Dr. Love is working here with 2019 data, rather than 2020, as you’ll use. The guess_max result ensures that read_csv will look through the entire data set (which has less than 4000 rows) instead of just the first 1000 rows (which is the default.)

The code below actually loads in the data from County Health Rankings directly, using the 2019 period.

data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2019.csv"
chr_2019_raw <- read_csv(data_url, skip = 1, guess_max = 4000)

Note that you’ll need a different data_url (listed below) for the 2020 data.

data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2020_0.csv"

2 Data Development

2.1 Selecting My Data

I’ll be selecting data from the six “states” (Washington DC, Delaware, Connecticut, Hawaii, New Hampshire and Rhode Island) that are not available to you (because they each have only a few counties: in total there are just 31 counties in those six states.) Note that in your work, you will include Ohio, and other states, but all of the states I’ve selected are not available to you. Also, you’ll have to describe a reason why you selected your group of states, which I’ll skip here.

I’ve selected five variables (v147, v145, v021, v023 and v139) which I’ll describe shortly. You will make your own choices, of course, and you’ll need to provide more information on each variable in a codebook.

To help you think about the chunk of code below, note that the code below does the following things:

  1. Filter the data to the actual counties that are ranked in the Rankings (this eliminates state and USA totals, mainly.)
  2. Filter to the states we’ve selected (the %in% command lets us include any state that is in the list we then create with the c() function).
  3. Select the variables that we’re going to use in our study, including the three mandatory variables (fipscode, state and county).
  4. Rename the five variables we’ve selected with more meaningful names. These names are motivated by the actual meaning of the variables, as shown in the top row (that we deleted) in the original csv, the PDF files I’ve included for you, and the more detailed variable descriptions on the County Health Ranking site.
chr_2019 <- chr_2019_raw %>%
    filter(county_ranked == 1) %>%
    filter(state %in% c("DC", "DE", "CT", "HI", "NH", "RI")) %>%
    select(fipscode, state, county, 
           v147_rawvalue, v145_rawvalue, v021_rawvalue, 
           v023_rawvalue, v139_rawvalue) %>%
    rename(life_expectancy = v147_rawvalue,
           freq_mental_distress = v145_rawvalue,
           hsgraduation = v021_rawvalue,
           unemployment = v023_rawvalue,
           food_insecurity = v139_rawvalue)

2.2 Repairing the fipscode and factoring the state

The fipscode is just a numerical code, and not a meaningful number (so that, for instance, calculating the mean of fipscode would make no sense.) To avoid confusion later, it’s worth it to tell R to treat fipscode as a character variable, rather than a double-precision numeric variable.

But there’s a problem with doing this, as R has already missed the need to pull in some leading zeros (the FIPS code is a 5-digit number which identifies a state (with the first two digits) and then a county (with the remaining three digits) but by reading the fipscode in as a numeric variable, some of the values you wind up with will be from states that need an opening zero in order to get to five digits total.)

We can fix this by applying a function from the stringr package (part of the tidyverse,) which will both add a “zero” to any fips code with less than 5 digits, but will also turn fipscode into a character variable, which is a better choice for a numeric code.

It will also be helpful later to include state as a factor variable, rather than a character.

We can accomplish these two tasks with the following chunk of code.

chr_2019 <- chr_2019 %>%
    mutate(fipscode = str_pad(fipscode, 5, pad = "0"),
           state = factor(state))

You can certainly use as.factor instead of factor here if you like. If you wish to arrange the levels of your states factor in an order other than alphabetically by postal abbreviation (perhaps putting Ohio first or something), then you could do so with fct_recode(), but I won’t do that here.

2.2.1 Checking Initial Work

Given the “states” I selected, I should have 31 rows, since there are 31 counties across those states, and I should have 8 variables. It’s also helpful to glimpse through the data and be sure nothing strange has happened in terms of what the first few values look like. Note the leading zeros in fipscode (and that it’s now a character variable) and that state is now a factor, as we’d hoped.

glimpse(chr_2019)
Rows: 31
Columns: 8
$ fipscode             <chr> "09001", "09003", "09005", "09007", "09009", "...
$ state                <fct> CT, CT, CT, CT, CT, CT, CT, CT, DE, DE, DE, DC...
$ county               <chr> "Fairfield County", "Hartford County", "Litchf...
$ life_expectancy      <dbl> 82.57683, 80.31690, 80.69485, 81.06904, 79.981...
$ freq_mental_distress <dbl> 0.09663473, 0.10244227, 0.09852291, 0.09609184...
$ hsgraduation         <dbl> 0.8942186, 0.8551798, 0.9043855, 0.9460923, 0....
$ unemployment         <dbl> 0.04506580, 0.04842678, 0.04308845, 0.04052900...
$ food_insecurity      <dbl> 0.096, 0.119, 0.095, 0.099, 0.124, 0.115, 0.09...

Looks good. I can check to see that each of my states has the anticipated number of counties, too.

chr_2019 %>% tabyl(state) %>% adorn_pct_formatting() 
 state  n percent
    CT  8   25.8%
    DC  1    3.2%
    DE  3    9.7%
    HI  4   12.9%
    NH 10   32.3%
    RI  5   16.1%

OK. These results match up with what I was expecting.

2.3 Creating Binary Categorical Variables

First, I’m going to make a binary categorical variable using the unemployment variable. Note that categorizing a quantitative variable like this is (in practice) a terrible idea, but we’re doing it here so that you can demonstrate some facility with modeling using a categorical variable.

We have numerous options for creating a binary variable.

2.3.1 Splitting into two categories based on the median

chr_2019 <- chr_2019 %>%
    mutate(temp1_ms = case_when(
                   unemployment < median(unemployment) ~ "low",
                   TRUE ~ "high"),
           temp1_ms = factor(temp1_ms))

mosaic::favstats(unemployment ~ temp1_ms, data = chr_2019) %>% 
    kable(digits = 3)
temp1_ms min Q1 median Q3 max mean sd n missing
high 0.039 0.041 0.045 0.049 0.061 0.046 0.005 16 0
low 0.022 0.024 0.026 0.028 0.038 0.027 0.005 15 0

2.3.2 Splitting into two categories based on a specific value

chr_2019 <- chr_2019 %>%
    mutate(temp2_4pct = case_when(
                   unemployment < 0.04 ~ "below4percent",
                   TRUE ~ "above4percent"),
           temp2_4pct = factor(temp2_4pct))

mosaic::favstats(unemployment ~ temp2_4pct, data = chr_2019) %>% 
    kable(digits = 3)
temp2_4pct min Q1 median Q3 max mean sd n missing
above4percent 0.040 0.042 0.045 0.049 0.061 0.046 0.005 15 0
below4percent 0.022 0.024 0.026 0.028 0.039 0.027 0.005 16 0

2.3.3 Using cut2 from Hmisc to split into two categories as evenly as possible

chr_2019 <- chr_2019 %>%
    mutate(temp3_cut2 = factor(Hmisc::cut2(unemployment, g = 2)))

mosaic::favstats(unemployment ~ temp3_cut2, data = chr_2019) %>% 
    kable(digits = 3)
temp3_cut2 min Q1 median Q3 max mean sd n missing
[0.0218,0.0400) 0.022 0.024 0.026 0.028 0.039 0.027 0.005 16 0
[0.0400,0.0605] 0.040 0.042 0.045 0.049 0.061 0.046 0.005 15 0

This approach is nice in one way, because it specifies the groups with a mathematical interval, but those factor level names can be rather unwieldy in practice. I might tweak them:

chr_2019 <- chr_2019 %>%
    mutate(temp3_cut2 = factor(Hmisc::cut2(unemployment, g = 2)),
           temp4_newnames = fct_recode(temp3_cut2,
                                         lessthan4 = "[0.0218,0.0400)",
                                         higher = "[0.0400,0.0605]"))

mosaic::favstats(unemployment ~ temp4_newnames, data = chr_2019) %>% 
    kable(digits = 3)
temp4_newnames min Q1 median Q3 max mean sd n missing
lessthan4 0.022 0.024 0.026 0.028 0.039 0.027 0.005 16 0
higher 0.040 0.042 0.045 0.049 0.061 0.046 0.005 15 0

2.3.4 Cleaning up

So, I’ve created four different variables here, when I only need the one. I’ll go with the median split approach, (which I’ll rename unemp_cat) and then drop the other attempts I created from my tibble in this next bit of code. Notice the use of the minus sign (-) before the list of variables I’m dropping in the select statement.

chr_2019 <- chr_2019 %>%
    rename(unemp_cat = temp1_ms) %>%
    select(-c(temp2_4pct, temp3_cut2, temp4_newnames))

Let’s check - we should still have 31 rows, but now we should have 9 columns (variables), since we’ve added the unemp_cat column to the data.

names(chr_2019)
[1] "fipscode"             "state"                "county"              
[4] "life_expectancy"      "freq_mental_distress" "hsgraduation"        
[7] "unemployment"         "food_insecurity"      "unemp_cat"           
nrow(chr_2019)
[1] 31

OK. Still looks fine.

2.4 Creating Multi-Category Variables

Now, I’m going to demonstrate the creation of a multi-category variable based on the hsgraduation variable. I’ll briefly reiterate that categorizing a quantitative variable like this is (in practice) a terrible, no good, very bad idea, but we’re doing it anyway for pedagogical rather than scientific reasons.

2.4.1 Creating a Three-Category Variable

Suppose we want to create three groups of equal size (which, since we have only 31 observations and need to have at least 10 in each group, is really our only choice in my example) and want to use the cut2 function from the Hmisc package.

chr_2019 <- chr_2019 %>%
    mutate(temp3 = factor(Hmisc::cut2(hsgraduation, g = 3)))

mosaic::favstats(hsgraduation ~ temp3, data = chr_2019) %>% 
    kable(digits = 3)
temp3 min Q1 median Q3 max mean sd n missing
[0.724,0.880) 0.724 0.812 0.840 0.852 0.877 0.826 0.042 11 0
[0.880,0.909) 0.880 0.882 0.888 0.894 0.904 0.889 0.009 10 0
[0.909,0.946] 0.909 0.923 0.930 0.936 0.946 0.930 0.011 10 0
chr_2019 <- chr_2019 %>%
    mutate(hsgrad_cat = fct_recode(temp3,
                                   bottom = "[0.724,0.880)",
                                   middle = "[0.880,0.909)",
                                   top = "[0.909,0.946]"))

mosaic::favstats(hsgraduation ~ hsgrad_cat, data = chr_2019) %>% 
    kable(digits = 3)
hsgrad_cat min Q1 median Q3 max mean sd n missing
bottom 0.724 0.812 0.840 0.852 0.877 0.826 0.042 11 0
middle 0.880 0.882 0.888 0.894 0.904 0.889 0.009 10 0
top 0.909 0.923 0.930 0.936 0.946 0.930 0.011 10 0
  1. Note that this same approach (changing g to 4 or 5 as appropriate) could be used to create a 4-category or 5-category variable.
  2. Note also that I used (bottom, middle, top) as the names of my categories instead of, for instance, (low, middle, high).
    • I did this so that R’s default factor sorting (which is alphabetical) would still give me a reasonable order. Otherwise, I’d need to add a fct_relevel step to sort the categories by hand in some reasonable way.
    • Another good trick might have been to precede names that wouldn’t be in the order I want them alphabetically with a number so they sort in a sensible order, perhaps with (1_high, 2_med, 3_low.)

2.4.2 Creating a 5-Category variable with Specified Cutpoints

Suppose we want to split our hsgraduation data so that we have five categories, based on the cutpoints (0.8, 0.85, 0.9 and 0.92). These four cutpoints will produce five mutually exclusive (no county can be in more than one category) and collectively exhaustive (every county is assigned to a category) categories:

  1. hsgraduation rate below 0.80,
  2. 0.80 up to but not including 0.85,
  3. 0.85 up to but not including 0.90,
  4. 0.90 up to but not including 0.92, and
  5. hsgraduation rate of 0.92 or more
chr_2019 <- chr_2019 %>%
    mutate(temp4 = case_when(
        hsgraduation < 0.8 ~ "1_lowest",
        hsgraduation < 0.85 ~ "2_low",
        hsgraduation < 0.9 ~ "3_middle",
        hsgraduation < 0.92 ~ "4_high",
        TRUE ~ "5_highest"),
        temp4 = factor(temp4))

mosaic::favstats(hsgraduation ~ temp4, data = chr_2019) %>% 
    kable(digits = 3)
temp4 min Q1 median Q3 max mean sd n missing
1_lowest 0.724 0.740 0.756 0.773 0.789 0.756 0.046 2 0
2_low 0.801 0.826 0.837 0.842 0.848 0.831 0.017 6 0
3_middle 0.855 0.878 0.882 0.888 0.894 0.880 0.013 11 0
4_high 0.901 0.903 0.907 0.912 0.920 0.909 0.008 4 0
5_highest 0.923 0.928 0.931 0.939 0.946 0.933 0.009 8 0

I’ll just note that it is also possible to set cutpoints with Hmisc::cut2.

2.4.3 Cleaning up

So, I’ve created two multi-categorical variables, but I will just retain the 3-category version (which I called hsgrad_cat) and drop the other temporary efforts.

chr_2019 <- chr_2019 %>%
    select(-c(temp3, temp4))

2.5 Structure of My Tibble

Next, I’ll print the structure of my tibble. I’m checking to see that:

  • the initial row tells me that this is a tibble and specifies its dimensions
  • I still have the complete set of 31 rows (counties)
  • I’ve included only 10 variables:
    • the three required variables fipscode, county and state, where I’ll also check that fipscode and county should be character () variables, and state should be a factor variable (), with an appropriate number of levels
    • my original five selected variables, properly renamed and all of numerical () type (this may also be specified as double-precision or , which is fine)
    • my two categorical variables unemp_cat and hsgrad_cat which should each be factors with appropriate levels specified, followed by numerical codes
str(chr_2019)
tibble [31 x 10] (S3: tbl_df/tbl/data.frame)
 $ fipscode            : chr [1:31] "09001" "09003" "09005" "09007" ...
 $ state               : Factor w/ 6 levels "CT","DC","DE",..: 1 1 1 1 1 1 1 1 3 3 ...
 $ county              : chr [1:31] "Fairfield County" "Hartford County" "Litchfield County" "Middlesex County" ...
 $ life_expectancy     : num [1:31] 82.6 80.3 80.7 81.1 80 ...
 $ freq_mental_distress: num [1:31] 0.0966 0.1024 0.0985 0.0961 0.1109 ...
 $ hsgraduation        : num [1:31] 0.894 0.855 0.904 0.946 0.834 ...
 $ unemployment        : num [1:31] 0.0451 0.0484 0.0431 0.0405 0.0503 ...
 $ food_insecurity     : num [1:31] 0.096 0.119 0.095 0.099 0.124 0.115 0.099 0.111 0.13 0.115 ...
 $ unemp_cat           : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 1 1 ...
 $ hsgrad_cat          : Factor w/ 3 levels "bottom","middle",..: 2 1 2 3 1 2 3 1 2 1 ...

Looks good so far. I think we are ready to go.

3 Codebook

This is a table listing all 10 variables that are included in your tibble, and providing some important information about them, mostly drawn from the County Health Ranking web site. For each of your five selected variables, be sure to include the original code (vXXX) from the raw file.

Variable Description
fipscode FIPS code
state State: my six states are CT, DC, DE, HI, NH, RI
county County Name
life_expectancy (v147) Life Expectancy, which will be my outcome
freq_mental_distress (v145) Frequent Mental Distress Rate
hsgraduation (v021) High School Graduation Rate
unemployment (v023) Unemployment Rate
food_insecurity (v139) Food Insecurity Rate
unemp_cat 2 levels: low = unemployment below 3.9%, or high
hsgrad_cat 3 levels: bottom = hsgraduation below 88%, middle or top = 90.9% or above

Note that I’ve provided details on the definition of our categorical variables.

More details on two of our original five variables are specified below. These results are rephrased versions of the summaries linked on the County Health Rankings site. You’ll need to provide information of this type as part of the codebook for all five of your selected variables.

  • lifeexpectancy was originally variable v147_rawvalue, and is listed in the Length of Life subcategory under Health Outcomes at County Health Rankings. It describes the average number of years a person residing in the county can expect to live, according to the current mortality experience (age-specific death rates) of the county’s population. It is based on data from the National Center for Health Statistics Mortality Files from 2016-18. This will be my outcome variable.

  • hsgraduation was originally variable v021_rawvalue, and is listed in the Education subcategory under Social & Economic Factors at County Health Rankings. It describes the proportion of the county’s ninth grade cohort that graduates with a high school diploma in four years, and is based on EDFacts data from 2016-17. Comparisons across state lines are not recommended because of differences in how states define the data, according to County Health Rankings.

3.1 Proposal Requirement 1

Remember that you will need to do five things in the proposal.

  1. a sentence or two (perhaps accompanied by a small table of R results) specifying the 4-6 states you chose, and the number of counties you are studying in total and within each state. In an additional sentence or two, provide some motivation for why you chose those states.

3.2 Proposal Requirement 2

  1. A list of the five variables (including their original raw names and your renamed versions) you are studying, with a clear indication of the cutpoints you chose to create the binary categories out of variable 4 and the multiple categories out of variable 5. Think of this as an early version of what will eventually become your codebook. For each variable, provide a sentence describing your motivation for why this variable was interesting to you, and also please specify which of your quantitative variables will serve as your outcome.

3.3 Proposal Requirement 3

Print the tibble, so we can verify that it is, in fact, a tibble, that prints the first 10 rows.

chr_2019
# A tibble: 31 x 10
   fipscode state county life_expectancy freq_mental_dis~ hsgraduation
   <chr>    <fct> <chr>            <dbl>            <dbl>        <dbl>
 1 09001    CT    Fairf~            82.6           0.0966        0.894
 2 09003    CT    Hartf~            80.3           0.102         0.855
 3 09005    CT    Litch~            80.7           0.0985        0.904
 4 09007    CT    Middl~            81.1           0.0961        0.946
 5 09009    CT    New H~            80.0           0.111         0.834
 6 09011    CT    New L~            79.9           0.109         0.890
 7 09013    CT    Tolla~            81.8           0.100         0.923
 8 09015    CT    Windh~            78.8           0.114         0.848
 9 10001    DE    Kent ~            77.8           0.117         0.901
10 10003    DE    New C~            78.7           0.110         0.877
# ... with 21 more rows, and 4 more variables: unemployment <dbl>,
#   food_insecurity <dbl>, unemp_cat <fct>, hsgrad_cat <fct>

3.4 Proposal Requirement 4

To meet proposal requirement 4, run describe from the Hmisc package.

Hmisc::describe(chr_2019)
chr_2019 

 10  Variables      31  Observations
--------------------------------------------------------------------------------
fipscode 
       n  missing distinct 
      31        0       31 

lowest : 09001 09003 09005 09007 09009, highest: 44001 44003 44005 44007 44009
--------------------------------------------------------------------------------
state 
       n  missing distinct 
      31        0        6 

lowest : CT DC DE HI NH, highest: DC DE HI NH RI
                                              
Value         CT    DC    DE    HI    NH    RI
Frequency      8     1     3     4    10     5
Proportion 0.258 0.032 0.097 0.129 0.323 0.161
--------------------------------------------------------------------------------
county 
       n  missing distinct 
      31        0       30 

lowest : Belknap County    Bristol County    Carroll County    Cheshire County   Coos County      
highest: Sullivan County   Sussex County     Tolland County    Washington County Windham County   
--------------------------------------------------------------------------------
life_expectancy 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1    79.99    1.719    78.04    78.34 
     .25      .50      .75      .90      .95 
   78.98    79.92    81.02    81.99    82.56 

lowest : 76.76078 77.80577 78.27487 78.34321 78.41546
highest: 81.80405 81.99087 82.53515 82.57683 82.66931
--------------------------------------------------------------------------------
freq_mental_distress 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1   0.1102  0.01115  0.09484  0.09663 
     .25      .50      .75      .90      .95 
 0.10358  0.11089  0.11691  0.12052  0.12094 

lowest : 0.08533063 0.09358767 0.09609184 0.09663473 0.09852291
highest: 0.11844904 0.12052429 0.12054963 0.12133815 0.13272832
--------------------------------------------------------------------------------
hsgraduation 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1   0.8799  0.05573   0.7950   0.8235 
     .25      .50      .75      .90      .95 
  0.8517   0.8854   0.9214   0.9319   0.9413 

lowest : 0.7236731 0.7892854 0.8007369 0.8235494 0.8338990
highest: 0.9302662 0.9318966 0.9370460 0.9455388 0.9460923
--------------------------------------------------------------------------------
unemployment 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       31        1  0.03647  0.01256  0.02219  0.02353 
     .25      .50      .75      .90      .95 
 0.02608  0.03922  0.04503  0.04974  0.05024 

lowest : 0.02181668 0.02198898 0.02239195 0.02353413 0.02358351
highest: 0.04842678 0.04973634 0.05018781 0.05028742 0.06051724
--------------------------------------------------------------------------------
food_insecurity 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
      31        0       21    0.994   0.1058  0.01585   0.0895   0.0900 
     .25      .50      .75      .90      .95 
  0.0975   0.1040   0.1150   0.1240   0.1290 

lowest : 0.074 0.089 0.090 0.093 0.095, highest: 0.119 0.124 0.128 0.130 0.132
--------------------------------------------------------------------------------
unemp_cat 
       n  missing distinct 
      31        0        2 
                      
Value       high   low
Frequency     16    15
Proportion 0.516 0.484
--------------------------------------------------------------------------------
hsgrad_cat 
       n  missing distinct 
      31        0        3 
                               
Value      bottom middle    top
Frequency      11     10     10
Proportion  0.355  0.323  0.323
--------------------------------------------------------------------------------

3.5 Three Important Checks

There are three important things I have to demonstrate, as described in Tasks C (Identify Your Variables) and D (Create Categorical Variables) in our Data Development work. They are:

  • Each of the five variables you select must have data for at least 75% of the counties in each state you plan to study.

Do we have any missing data here?

chr_2019 %>% 
    summarize(across(life_expectancy:food_insecurity, ~ sum(is.na(.))))
# A tibble: 1 x 5
  life_expectancy freq_mental_distress hsgraduation unemployment food_insecurity
            <int>                <int>        <int>        <int>           <int>
1               0                    0            0            0               0

Nope, so we’re OK!

If I did have some missingness, then I would probably want to summarize this by state, so that I could compare the results. Here’s a way to look at this just for the life_expectancy variable.

mosaic::favstats(life_expectancy ~ state, data = chr_2019) %>%
    select(state, n, missing) %>%
    mutate(pct_available = 100*(n - missing)/n) %>%
    kable()
state n missing pct_available
CT 8 0 100
DC 1 0 100
DE 3 0 100
HI 4 0 100
NH 10 0 100
RI 5 0 100

We’re OK, because 100% of the data are available. In my example, this is true for all five of the variables I used. In yours, that may or may not be the case. Remember that all of your selected variables need to be available in at least 75% of the counties in EACH state you study.

  • The raw versions of each of your five selected variables must have at least 10 distinct non-missing values.
chr_2019 %>% 
    summarize(across(life_expectancy:food_insecurity, ~ n_distinct(.)))
# A tibble: 1 x 5
  life_expectancy freq_mental_distress hsgraduation unemployment food_insecurity
            <int>                <int>        <int>        <int>           <int>
1              31                   31           31           31              21

OK. We’re fine there.

  • For each of the categorical variables you create, every level of the resulting factor must include at least 10 counties.
chr_2019 %>% tabyl(unemp_cat)
 unemp_cat  n  percent
      high 16 0.516129
       low 15 0.483871
chr_2019 %>% tabyl(hsgrad_cat)
 hsgrad_cat  n   percent
     bottom 11 0.3548387
     middle 10 0.3225806
        top 10 0.3225806

OK. I have at least 10 counties in each category for each of the categorical variables that I created.

3.6 Saving the Tibble

Finally, we’ll save this tibble as an R data set into the same location as our original data set within our R Project directory.

saveRDS(chr_2019, file = "chr_2019_Thomas_Love.Rds")

You’ll want to substitute in your own name, of course.

3.7 Proposal Requirement 5

Having done all of this work, the set of Proposal Requirements (repeated below) should be straightforward. We’ve already dealt with the first four. The fifth is repeated below.

  1. In a paragraph, describe the most challenging (or difficult) part of completing the work so far, and how you were able to overcome whatever it was that was difficult.

OK. That’s your job.

4 Analysis 1

4.1 The Variables

In this section you will build a simple linear regression model to predict your outcome using one of your two quantitative predictors (it’s your choice.) Start by identifying those variables, and restricting your data set for this analysis to the complete cases on those variables.

  • Use complete English sentences to identify your outcome and your predictor, describing what each variable means and its units of measurement.
  • Also specify the name of each variable in your tibble (making it clear which is the outcome and which the predictor), the name of your tibble, which states you’re studying, and how many counties have complete data on both variables.
  • Finally, specify the values of your outcome and predictor for Cuyahoga County, in Ohio, where CWRU’s campus is located.

4.2 Research Question

Here, you should state your research question clearly. A good research question (and your response to it) is the most important part of this analysis.

  • Do not use more than one research question for this Analysis.
  • A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete.

4.2.1 Guidance on Research Questions from Dr. Love

  • Examples of dull but moderately effective and minimally appropriate research questions in this setting would be:

How well does a linear model using [predictor] predict [outcome], in [number] counties in the states of [list of your states]?

or

What is the nature of the association between [predictor] and [outcome], in [number] counties in the states of [list of your states]?

  • You should be able to do meaningfully better than that, especially if you have a reason to believe something in advance about the direction or strength of the association you are anticipating.
  • However, if you’re struggling, using that format will be OK.
  • A research question uses formal but clear language.
  • Given your planned analyses, stay away from statements about cause and effect, and don’t use the words correlate or regression (in any form) in your research question.

4.3 Visualizing the Data

Provide an attractive, well-labeled scatterplot of your outcome and predictor, before any transformations, including all code necessary to build it, and describe what the plot tells you about the association of the variables, as detailed in the instructions.

4.4 Transformation Assessment

Here, you will provide information about any transformations you chose to apply to the outcome, and explain why you did (or didn’t) choose to use a transformation.

  • If you decide to use a transformation, specify the transformation you’ve chosen carefully, and show the scatterplot of the transformed data, using the same suggestions as were provided in the previous section to describe the resulting plot. Write a few sentences describing the transformations you considered, and why you thought this was the most promising, and why you eventually decided to use the transformed versions of your variables.

  • If you decide not to use a transformation, identify the ONE plot which was the most promising of available transformations and show that one. Write a few sentences describing the transformations you considered, and why you thought this was the most promising, and why you eventually decided to stick with the original versions of the variables.

4.4.1 Guidance from Dr. Love on the Transformation Assessment

Your response will include ONE plot demonstrating a particular transformation, although you will probably fit and view several plots in order to select a final one for this work.

For this part of Project A, confine your search to either a logarithm, an inverse, or a square as applied to the outcome. If you want to consider one of those transformations for the predictor as well, that’s OK but not crucial. You should select the most promising transformation on the basis of a scatterplot (perhaps with a loess smooth and linear fit) after the transformation has been applied.

You are discouraged from using numerical summaries of fit in this section. Of course, summary statistics like \(R^2\) when the outcome is transformed will definitely not be comparable.

Note that this transformation assessment is part of Analysis 1, but will not be shown in Analyses 2 or 3. Feel free to use the transformation (of the outcome) that you select in Analysis 1 for the other two Analyses, if you like.

4.5 The Fitted Model

Fit your model to use your predictor to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

  1. A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean in context.

  2. A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates.

  3. The model’s R-squared, residual standard error, the number of observations to which the model was fit, and the Pearson correlation of your predictor and outcome.

4.6 Residual Analysis

Here, you’ll need to do four things.

  1. prepare a pair of residual plots (one to assess residuals vs. fitted values for non-linearity, and one to assess Normality in the residuals or the standardized residuals.)
  2. interpret those plots in terms of what they tell you about how well the assumptions of linearity and Normality hold for your setting, in complete English sentences.
  3. display your model’s prediction for the original (untransformed) outcome you are studying for Cuyahoga County, in Ohio, and compare it to Cuyahoga’s actual value of this outcome.
  4. identify the two counties (by name and state) where the model you’ve fit is least successful at predicting the outcome (in the sense of having the largest residual in absolute value.)
  • The augment function would be very helpful here.
  • If you model is called m1, you could use something like plot(m1, which = c(1:2)) to obtain these two plots and that’s OK.
  • You can produce a more pleasing picture using ggplot2 and patchwork following the strategy I’ve demonstrated multiple times in the slides and the Course Notes, should you desire.

4.7 Conclusions and Limitations

Here, you’ll write two paragraphs.

In the first paragraph, you should provide a clear restatement of your research question, followed by a clear and appropriate response to your research question, motivated by your results.

Then, write a paragraph which summarizes the key limitations of your work in Analysis 1.

  • If you see problems with regression assumptions in your residual plot, that would be a good thing to talk about here, for instance.
  • Another issue that may be worth discussing is your target population, and what evidence you can describe that might indicate whether your selected states are a representative sample of the US as a whole, or perhaps some particular part of the United States.

5 Analysis 2

5.1 The Variables

In this section you will build a simple linear regression model to predict your outcome using either of your two categorical predictors (it’s your choice.) Start by identifying those variables, and restricting your data set for this analysis to the complete cases on those variables.

  • Use complete English sentences to identify your outcome and your predictor, describing what each variable means and the available categories for the predictor. Be sure the predictor is represented in R as a factor, with an appropriate ordering.
  • Again, specify the name of each variable in your tibble (making it clear which is the outcome and which the predictor), the name of your tibble, which states you’re studying, and how many counties have complete data on both variables.
  • Finally, specify the values of your outcome and predictor for Cuyahoga County, in Ohio, where CWRU’s campus is located.

5.2 Research Question

Here, you should state your research question clearly. A good research question (and your response to it) is the most important part of this analysis.

  • Do not use more than one research question for this Analysis.
  • A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete.

5.2.1 Guidance on Research Questions from Dr. Love

  • Examples of dull but moderately effective and minimally appropriate research questions for analysis 2 would be:

Does [category A] or [category B] show higher levels of [outcome], in [number] counties in the states of [list of your states]? (for a binary predictor)

or

Which of [the categories in your predictor] is associated with the highest mean level of [outcome], in [number] counties in the states of [list of your states]? (for a multi-categorical predictor)

  • Otherwise, use the same guidance about research questions I provided for analysis 1.

5.3 Visualizing the Data

In Analysis 2, it is up to decide whether or not a transformation of the outcome would be valuable. Your model will assume that the distribution of the outcome is Normal, with similar variance across each level of your categorical predictor. If you decide to transform the outcome, again, stick with either the logarithm, inverse or square.

Provide an attractive boxplot (with or without a violin plot) showing your outcome broken down by levels of your predictor, after whatever transformation you choose (if any), including all code necessary to build it, and describe what the plot tells you about the association of the variables, as detailed in the instructions.

5.4 The Fitted Model

Fit your model to use your predictor to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

  1. A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean in context.

  2. A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates.

  3. The model’s R-squared, residual standard error, and the number of observations to which the model was fit.

5.5 Prediction Analysis

Here, you’ll need to do three things.

  1. plot the residuals against the categorical predictor in a useful way to help you assess Normality of the residuals within each category.
  2. display your model’s prediction for the original (untransformed) outcome you are studying for Cuyahoga County, in Ohio, and compare it to Cuyahoga’s actual value of this outcome.
  3. identify the two counties (by name and state) where the model you’ve fit is least successful at predicting the outcome (in the sense of having the largest residual in absolute value.)

Again, augment would be very helpful here.

5.6 Conclusions and Limitations

Your first paragraph in this response should provide a clear restatement of your research question, and then a clear and appropriate response to your research question, motivated by your results. This should include a statement comparing the categories on the mean of the outcome you modeled.

Then, in your second and final paragraph in this section, provide a brief description of the limitations of this Analysis, Be specific about your concerns.

6 Analysis 3

In this section you will build a linear regression model to predict your outcome using one of your two quantitative predictors (it’s your choice) and the state (which should be a factor in your model with Ohio as the baseline category.)

6.1 The Variables

  • Do everything in this section that I recommended for the Variables section in Analysis 1.

6.2 Research Question

Here, you should state your research question clearly. A good research question (and your response to it) is the most important part of this analysis.

  • Do not use more than one research question for this Analysis.
  • A research question will end with a question mark, and will be something you will be able to answer (or at least respond to effectively) after your analysis is complete.

6.2.1 Guidance on Research Questions from Dr. Love

A dull but moderately effective and minimally appropriate research questions in this setting would be:

How well does [predictor] predict [outcome] after accounting for differences between states?

  • Again, follow the guidance from Analysis 1.

6.3 Visualizing the Data

Provide an attractive, well-labeled scatterplot of your outcome and your quantitative predictor, and incorporate some way of providing separate results by state. You can do this in several ways in R, including through the use of facets.

  • Plot the data incorporating the transformation for your outcome (if any) that you will use in your model for Analysis 3.
  • In your model for Analysis 3, use the same transformation for the outcome that you used in Analysis 1.
  • Show your code, and describe what the plot tells you about the association of the variables.

6.4 The Fitted Model

Fit your model to use your predictors to predict your outcome (applying your selected transformation) and provide the code you used, and the following summary elements in this section.

  1. A written statement of the full prediction equation, with coefficients nicely rounded, and a careful description of what the coefficients mean for the intercept, the quantitative predictor and one of the states, in context.

  2. A tidy summary of the model’s coefficients, including 90% confidence interval for model estimates. Be sure that Ohio is used as the baseline state.

  3. The model’s R-squared, residual standard error, the number of observations to which the model was fit.

6.5 Residual Analysis

Do what you did in Analysis 1.

6.6 Conclusion and Limitations

Your first paragraph in this response should provide a clear restatement of your research question, and then a clear and appropriate response to your research question, motivated by your results.

Then, write a paragraph to summarize the key limitations of your work, as you did in the previous analyses.

7 Session Information

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggridges_0.5.2    mosaicData_0.20.1 ggformula_0.9.4   ggstance_0.3.4   
 [5] Matrix_1.2-18     lattice_0.20-41   forcats_0.5.0     stringr_1.4.0    
 [9] dplyr_1.0.2       purrr_0.3.4       readr_1.3.1       tidyr_1.1.2      
[13] tibble_3.0.3      ggplot2_3.3.2     tidyverse_1.3.0   magrittr_1.5     
[17] janitor_2.0.1     rmdformats_0.3.7  knitr_1.30       

loaded via a namespace (and not attached):
 [1] fs_1.5.0            lubridate_1.7.9     RColorBrewer_1.1-2 
 [4] httr_1.4.2          tools_4.0.2         backports_1.1.10   
 [7] utf8_1.1.4          R6_2.4.1            rpart_4.1-15       
[10] Hmisc_4.4-1         DBI_1.1.0           colorspace_1.4-1   
[13] nnet_7.3-14         withr_2.3.0         tidyselect_1.1.0   
[16] gridExtra_2.3       leaflet_2.0.3       curl_4.3           
[19] compiler_4.0.2      cli_2.0.2           rvest_0.3.6        
[22] htmlTable_2.1.0     xml2_1.3.2          ggdendro_0.1.22    
[25] bookdown_0.20       checkmate_2.0.0     mosaicCore_0.8.0   
[28] scales_1.1.1        digest_0.6.25       foreign_0.8-80     
[31] rmarkdown_2.4       jpeg_0.1-8.1        base64enc_0.1-3    
[34] pkgconfig_2.0.3     htmltools_0.5.0     dbplyr_1.4.4       
[37] highr_0.8           htmlwidgets_1.5.1   rlang_0.4.7        
[40] readxl_1.3.1        rstudioapi_0.11     farver_2.0.3       
[43] generics_0.0.2      jsonlite_1.7.1      crosstalk_1.1.0.1  
[46] Formula_1.2-3       Rcpp_1.0.5          munsell_0.5.0      
[49] fansi_0.4.1         lifecycle_0.2.0     stringi_1.5.3      
[52] yaml_2.2.1          snakecase_0.11.0    MASS_7.3-53        
[55] plyr_1.8.6          grid_4.0.2          blob_1.2.1         
[58] ggrepel_0.8.2       crayon_1.3.4        haven_2.3.1        
[61] splines_4.0.2       hms_0.5.3           pillar_1.4.6       
[64] reprex_0.3.0        glue_1.4.2          evaluate_0.14      
[67] latticeExtra_0.6-29 data.table_1.13.0   modelr_0.1.8       
[70] png_0.1-7           vctrs_0.3.4         tweenr_1.0.1       
[73] cellranger_1.1.0    gtable_0.3.0        polyclip_1.10-0    
 [ reached getOption("max.print") -- omitted 8 entries ]