YOUR PROJECT A TITLE
0.1 Note to Students
Dr. Love updated this document on 2020-09-27 to add some good ideas from an initial set of proposal reviews.
This proposal includes a lot of description from Dr. Love about what he’s doing and what’s happening in the R code chunks that should not be included in your proposal (as an example, this whole section shouldn’t be in your proposal.) It also doesn’t include several things that you will need to include in your proposal.
Think of this document as an annotated starting point for thinking about developing your proposal, rather than as a rigid template that just requires you to fill in a few gaps. There is still a lot of work for you to do. Your job in building your proposal requires you to (at a minimum):
- adapt the code provided here to address your own decisions and requirements (more than just filling in your title and name, although that’s an important thing to do.)
- edit what is provided here so that you wind up only including things that are appropriate for your project
- write your own descriptions of the states/measures you’re using and the results you obtain (which Dr. Love has mostly left out of this document.)
- knit the R Markdown document into an HTML or PDF report, and then proofreading and spell-checking all of your work before you submit it.
You should be certain you have a real title and author list in this file.
1 Preliminaries
1.1 My R Packages
Note that I have loaded the tidyverse
last, and that I have not loaded any of the tidyverse
packages individually. We’ll be checking to see that you’ve done this properly. These are the three packages that Dr. Love has used in preparing this proposal, and don’t include packages (like patchwork and broom, for instance) that he almost certainly would need to use in his analyses, yet. The final project itself should include all packages that get used.
1.2 Data Ingest
Note that Dr. Love is working here with 2019 data, rather than 2020, as you’ll use. The guess_max
result ensures that read_csv
will look through the entire data set (which has less than 4000 rows) instead of just the first 1000 rows (which is the default.)
The code below actually loads in the data from County Health Rankings directly, using the 2019 period.
data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2019.csv"
chr_2019_raw <- read_csv(data_url, skip = 1, guess_max = 4000)
Note that you’ll need a different data_url
(listed below) for the 2020 data.
data_url <- "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2020_0.csv"
2 Data Development
2.1 Selecting My Data
I’ll be selecting data from the six “states” (Washington DC, Delaware, Connecticut, Hawaii, New Hampshire and Rhode Island) that are not available to you (because they each have only a few counties: in total there are just 31 counties in those six states.) Note that in your work, you will include Ohio, and other states, but all of the states I’ve selected are not available to you. Also, you’ll have to describe a reason why you selected your group of states, which I’ll skip here.
I’ve selected five variables (v147, v145, v021, v023 and v139) which I’ll describe shortly. You will make your own choices, of course, and you’ll need to provide more information on each variable in a codebook.
To help you think about the chunk of code below, note that the code below does the following things:
- Filter the data to the actual counties that are ranked in the Rankings (this eliminates state and USA totals, mainly.)
- Filter to the states we’ve selected (the
%in%
command lets us include any state that is in the list we then create with the c() function). - Select the variables that we’re going to use in our study, including the three mandatory variables (fipscode, state and county).
- Rename the five variables we’ve selected with more meaningful names. These names are motivated by the actual meaning of the variables, as shown in the top row (that we deleted) in the original csv, the PDF files I’ve included for you, and the more detailed variable descriptions on the County Health Ranking site.
chr_2019 <- chr_2019_raw %>%
filter(county_ranked == 1) %>%
filter(state %in% c("DC", "DE", "CT", "HI", "NH", "RI")) %>%
select(fipscode, state, county,
v147_rawvalue, v145_rawvalue, v021_rawvalue,
v023_rawvalue, v139_rawvalue) %>%
rename(life_expectancy = v147_rawvalue,
freq_mental_distress = v145_rawvalue,
hsgraduation = v021_rawvalue,
unemployment = v023_rawvalue,
food_insecurity = v139_rawvalue)
2.2 Repairing the fipscode
and factoring the state
The fipscode
is just a numerical code, and not a meaningful number (so that, for instance, calculating the mean of fipscode
would make no sense.) To avoid confusion later, it’s worth it to tell R to treat fipscode
as a character variable, rather than a double-precision numeric variable.
But there’s a problem with doing this, as R has already missed the need to pull in some leading zeros (the FIPS code is a 5-digit number which identifies a state (with the first two digits) and then a county (with the remaining three digits) but by reading the fipscode
in as a numeric variable, some of the values you wind up with will be from states that need an opening zero in order to get to five digits total.)
We can fix this by applying a function from the stringr
package (part of the tidyverse,) which will both add a “zero” to any fips code with less than 5 digits, but will also turn fipscode
into a character variable, which is a better choice for a numeric code.
It will also be helpful later to include state
as a factor variable, rather than a character.
We can accomplish these two tasks with the following chunk of code.
You can certainly use as.factor
instead of factor
here if you like. If you wish to arrange the levels of your states
factor in an order other than alphabetically by postal abbreviation (perhaps putting Ohio first or something), then you could do so with fct_recode()
, but I won’t do that here.
2.2.1 Checking Initial Work
Given the “states” I selected, I should have 31 rows, since there are 31 counties across those states, and I should have 8 variables. It’s also helpful to glimpse through the data and be sure nothing strange has happened in terms of what the first few values look like. Note the leading zeros in fipscode
(and that it’s now a character variable) and that state
is now a factor, as we’d hoped.
Rows: 31
Columns: 8
$ fipscode <chr> "09001", "09003", "09005", "09007", "09009", "...
$ state <fct> CT, CT, CT, CT, CT, CT, CT, CT, DE, DE, DE, DC...
$ county <chr> "Fairfield County", "Hartford County", "Litchf...
$ life_expectancy <dbl> 82.57683, 80.31690, 80.69485, 81.06904, 79.981...
$ freq_mental_distress <dbl> 0.09663473, 0.10244227, 0.09852291, 0.09609184...
$ hsgraduation <dbl> 0.8942186, 0.8551798, 0.9043855, 0.9460923, 0....
$ unemployment <dbl> 0.04506580, 0.04842678, 0.04308845, 0.04052900...
$ food_insecurity <dbl> 0.096, 0.119, 0.095, 0.099, 0.124, 0.115, 0.09...
Looks good. I can check to see that each of my states has the anticipated number of counties, too.
state n percent
CT 8 25.8%
DC 1 3.2%
DE 3 9.7%
HI 4 12.9%
NH 10 32.3%
RI 5 16.1%
OK. These results match up with what I was expecting.
2.3 Creating Binary Categorical Variables
First, I’m going to make a binary categorical variable using the unemployment
variable. Note that categorizing a quantitative variable like this is (in practice) a terrible idea, but we’re doing it here so that you can demonstrate some facility with modeling using a categorical variable.
We have numerous options for creating a binary variable.
2.3.1 Splitting into two categories based on the median
chr_2019 <- chr_2019 %>%
mutate(temp1_ms = case_when(
unemployment < median(unemployment) ~ "low",
TRUE ~ "high"),
temp1_ms = factor(temp1_ms))
mosaic::favstats(unemployment ~ temp1_ms, data = chr_2019) %>%
kable(digits = 3)
temp1_ms | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
high | 0.039 | 0.041 | 0.045 | 0.049 | 0.061 | 0.046 | 0.005 | 16 | 0 |
low | 0.022 | 0.024 | 0.026 | 0.028 | 0.038 | 0.027 | 0.005 | 15 | 0 |
2.3.2 Splitting into two categories based on a specific value
chr_2019 <- chr_2019 %>%
mutate(temp2_4pct = case_when(
unemployment < 0.04 ~ "below4percent",
TRUE ~ "above4percent"),
temp2_4pct = factor(temp2_4pct))
mosaic::favstats(unemployment ~ temp2_4pct, data = chr_2019) %>%
kable(digits = 3)
temp2_4pct | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
above4percent | 0.040 | 0.042 | 0.045 | 0.049 | 0.061 | 0.046 | 0.005 | 15 | 0 |
below4percent | 0.022 | 0.024 | 0.026 | 0.028 | 0.039 | 0.027 | 0.005 | 16 | 0 |
2.3.3 Using cut2 from Hmisc to split into two categories as evenly as possible
chr_2019 <- chr_2019 %>%
mutate(temp3_cut2 = factor(Hmisc::cut2(unemployment, g = 2)))
mosaic::favstats(unemployment ~ temp3_cut2, data = chr_2019) %>%
kable(digits = 3)
temp3_cut2 | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
[0.0218,0.0400) | 0.022 | 0.024 | 0.026 | 0.028 | 0.039 | 0.027 | 0.005 | 16 | 0 |
[0.0400,0.0605] | 0.040 | 0.042 | 0.045 | 0.049 | 0.061 | 0.046 | 0.005 | 15 | 0 |
This approach is nice in one way, because it specifies the groups with a mathematical interval, but those factor level names can be rather unwieldy in practice. I might tweak them:
chr_2019 <- chr_2019 %>%
mutate(temp3_cut2 = factor(Hmisc::cut2(unemployment, g = 2)),
temp4_newnames = fct_recode(temp3_cut2,
lessthan4 = "[0.0218,0.0400)",
higher = "[0.0400,0.0605]"))
mosaic::favstats(unemployment ~ temp4_newnames, data = chr_2019) %>%
kable(digits = 3)
temp4_newnames | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
lessthan4 | 0.022 | 0.024 | 0.026 | 0.028 | 0.039 | 0.027 | 0.005 | 16 | 0 |
higher | 0.040 | 0.042 | 0.045 | 0.049 | 0.061 | 0.046 | 0.005 | 15 | 0 |
2.3.4 Cleaning up
So, I’ve created four different variables here, when I only need the one. I’ll go with the median split approach, (which I’ll rename unemp_cat
) and then drop the other attempts I created from my tibble in this next bit of code. Notice the use of the minus sign (-
) before the list of variables I’m dropping in the select statement.
chr_2019 <- chr_2019 %>%
rename(unemp_cat = temp1_ms) %>%
select(-c(temp2_4pct, temp3_cut2, temp4_newnames))
Let’s check - we should still have 31 rows, but now we should have 9 columns (variables), since we’ve added the unemp_cat
column to the data.
[1] "fipscode" "state" "county"
[4] "life_expectancy" "freq_mental_distress" "hsgraduation"
[7] "unemployment" "food_insecurity" "unemp_cat"
[1] 31
OK. Still looks fine.
2.4 Creating Multi-Category Variables
Now, I’m going to demonstrate the creation of a multi-category variable based on the hsgraduation
variable. I’ll briefly reiterate that categorizing a quantitative variable like this is (in practice) a terrible, no good, very bad idea, but we’re doing it anyway for pedagogical rather than scientific reasons.
2.4.1 Creating a Three-Category Variable
Suppose we want to create three groups of equal size (which, since we have only 31 observations and need to have at least 10 in each group, is really our only choice in my example) and want to use the cut2
function from the Hmisc
package.
chr_2019 <- chr_2019 %>%
mutate(temp3 = factor(Hmisc::cut2(hsgraduation, g = 3)))
mosaic::favstats(hsgraduation ~ temp3, data = chr_2019) %>%
kable(digits = 3)
temp3 | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
[0.724,0.880) | 0.724 | 0.812 | 0.840 | 0.852 | 0.877 | 0.826 | 0.042 | 11 | 0 |
[0.880,0.909) | 0.880 | 0.882 | 0.888 | 0.894 | 0.904 | 0.889 | 0.009 | 10 | 0 |
[0.909,0.946] | 0.909 | 0.923 | 0.930 | 0.936 | 0.946 | 0.930 | 0.011 | 10 | 0 |
chr_2019 <- chr_2019 %>%
mutate(hsgrad_cat = fct_recode(temp3,
bottom = "[0.724,0.880)",
middle = "[0.880,0.909)",
top = "[0.909,0.946]"))
mosaic::favstats(hsgraduation ~ hsgrad_cat, data = chr_2019) %>%
kable(digits = 3)
hsgrad_cat | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
bottom | 0.724 | 0.812 | 0.840 | 0.852 | 0.877 | 0.826 | 0.042 | 11 | 0 |
middle | 0.880 | 0.882 | 0.888 | 0.894 | 0.904 | 0.889 | 0.009 | 10 | 0 |
top | 0.909 | 0.923 | 0.930 | 0.936 | 0.946 | 0.930 | 0.011 | 10 | 0 |
- Note that this same approach (changing
g
to 4 or 5 as appropriate) could be used to create a 4-category or 5-category variable. - Note also that I used (bottom, middle, top) as the names of my categories instead of, for instance, (low, middle, high).
- I did this so that R’s default factor sorting (which is alphabetical) would still give me a reasonable order. Otherwise, I’d need to add a
fct_relevel
step to sort the categories by hand in some reasonable way. - Another good trick might have been to precede names that wouldn’t be in the order I want them alphabetically with a number so they sort in a sensible order, perhaps with (1_high, 2_med, 3_low.)
- I did this so that R’s default factor sorting (which is alphabetical) would still give me a reasonable order. Otherwise, I’d need to add a
2.4.2 Creating a 5-Category variable with Specified Cutpoints
Suppose we want to split our hsgraduation
data so that we have five categories, based on the cutpoints (0.8, 0.85, 0.9 and 0.92). These four cutpoints will produce five mutually exclusive (no county can be in more than one category) and collectively exhaustive (every county is assigned to a category) categories:
hsgraduation
rate below 0.80,- 0.80 up to but not including 0.85,
- 0.85 up to but not including 0.90,
- 0.90 up to but not including 0.92, and
hsgraduation
rate of 0.92 or more
chr_2019 <- chr_2019 %>%
mutate(temp4 = case_when(
hsgraduation < 0.8 ~ "1_lowest",
hsgraduation < 0.85 ~ "2_low",
hsgraduation < 0.9 ~ "3_middle",
hsgraduation < 0.92 ~ "4_high",
TRUE ~ "5_highest"),
temp4 = factor(temp4))
mosaic::favstats(hsgraduation ~ temp4, data = chr_2019) %>%
kable(digits = 3)
temp4 | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
1_lowest | 0.724 | 0.740 | 0.756 | 0.773 | 0.789 | 0.756 | 0.046 | 2 | 0 |
2_low | 0.801 | 0.826 | 0.837 | 0.842 | 0.848 | 0.831 | 0.017 | 6 | 0 |
3_middle | 0.855 | 0.878 | 0.882 | 0.888 | 0.894 | 0.880 | 0.013 | 11 | 0 |
4_high | 0.901 | 0.903 | 0.907 | 0.912 | 0.920 | 0.909 | 0.008 | 4 | 0 |
5_highest | 0.923 | 0.928 | 0.931 | 0.939 | 0.946 | 0.933 | 0.009 | 8 | 0 |
I’ll just note that it is also possible to set cutpoints with Hmisc::cut2
.
2.5 Structure of My Tibble
Next, I’ll print the structure of my tibble. I’m checking to see that:
- the initial row tells me that this is a tibble and specifies its dimensions
- I still have the complete set of 31 rows (counties)
- I’ve included only 10 variables:
- the three required variables
fipscode
,county
andstate
, where I’ll also check thatfipscode
andcounty
should be character () variables, and state
should be a factor variable (), with an appropriate number of levels - my original five selected variables, properly renamed and all of numerical (
) type (this may also be specified as double-precision or , which is fine) - my two categorical variables
unemp_cat
andhsgrad_cat
which should each be factorswith appropriate levels specified, followed by numerical codes
- the three required variables
tibble [31 x 10] (S3: tbl_df/tbl/data.frame)
$ fipscode : chr [1:31] "09001" "09003" "09005" "09007" ...
$ state : Factor w/ 6 levels "CT","DC","DE",..: 1 1 1 1 1 1 1 1 3 3 ...
$ county : chr [1:31] "Fairfield County" "Hartford County" "Litchfield County" "Middlesex County" ...
$ life_expectancy : num [1:31] 82.6 80.3 80.7 81.1 80 ...
$ freq_mental_distress: num [1:31] 0.0966 0.1024 0.0985 0.0961 0.1109 ...
$ hsgraduation : num [1:31] 0.894 0.855 0.904 0.946 0.834 ...
$ unemployment : num [1:31] 0.0451 0.0484 0.0431 0.0405 0.0503 ...
$ food_insecurity : num [1:31] 0.096 0.119 0.095 0.099 0.124 0.115 0.099 0.111 0.13 0.115 ...
$ unemp_cat : Factor w/ 2 levels "high","low": 1 1 1 1 1 1 1 1 1 1 ...
$ hsgrad_cat : Factor w/ 3 levels "bottom","middle",..: 2 1 2 3 1 2 3 1 2 1 ...
Looks good so far. I think we are ready to go.
3 Codebook
This is a table listing all 10 variables that are included in your tibble, and providing some important information about them, mostly drawn from the County Health Ranking web site. For each of your five selected variables, be sure to include the original code (vXXX) from the raw file.
Variable | Description |
---|---|
fipscode | FIPS code |
state | State: my six states are CT, DC, DE, HI, NH, RI |
county | County Name |
life_expectancy | (v147) Life Expectancy, which will be my outcome |
freq_mental_distress | (v145) Frequent Mental Distress Rate |
hsgraduation | (v021) High School Graduation Rate |
unemployment | (v023) Unemployment Rate |
food_insecurity | (v139) Food Insecurity Rate |
unemp_cat | 2 levels: low = unemployment below 3.9%, or high |
hsgrad_cat | 3 levels: bottom = hsgraduation below 88%, middle or top = 90.9% or above |
Note that I’ve provided details on the definition of our categorical variables.
More details on two of our original five variables are specified below. These results are rephrased versions of the summaries linked on the County Health Rankings site. You’ll need to provide information of this type as part of the codebook for all five of your selected variables.
lifeexpectancy
was originally variablev147_rawvalue
, and is listed in the Length of Life subcategory under Health Outcomes at County Health Rankings. It describes the average number of years a person residing in the county can expect to live, according to the current mortality experience (age-specific death rates) of the county’s population. It is based on data from the National Center for Health Statistics Mortality Files from 2016-18. This will be my outcome variable.hsgraduation
was originally variablev021_rawvalue
, and is listed in the Education subcategory under Social & Economic Factors at County Health Rankings. It describes the proportion of the county’s ninth grade cohort that graduates with a high school diploma in four years, and is based on EDFacts data from 2016-17. Comparisons across state lines are not recommended because of differences in how states define the data, according to County Health Rankings.
3.1 Proposal Requirement 1
Remember that you will need to do five things in the proposal.
- a sentence or two (perhaps accompanied by a small table of R results) specifying the 4-6 states you chose, and the number of counties you are studying in total and within each state. In an additional sentence or two, provide some motivation for why you chose those states.
3.2 Proposal Requirement 2
- A list of the five variables (including their original raw names and your renamed versions) you are studying, with a clear indication of the cutpoints you chose to create the binary categories out of variable 4 and the multiple categories out of variable 5. Think of this as an early version of what will eventually become your codebook. For each variable, provide a sentence describing your motivation for why this variable was interesting to you, and also please specify which of your quantitative variables will serve as your outcome.
3.3 Proposal Requirement 3
Print the tibble, so we can verify that it is, in fact, a tibble, that prints the first 10 rows.
# A tibble: 31 x 10
fipscode state county life_expectancy freq_mental_dis~ hsgraduation
<chr> <fct> <chr> <dbl> <dbl> <dbl>
1 09001 CT Fairf~ 82.6 0.0966 0.894
2 09003 CT Hartf~ 80.3 0.102 0.855
3 09005 CT Litch~ 80.7 0.0985 0.904
4 09007 CT Middl~ 81.1 0.0961 0.946
5 09009 CT New H~ 80.0 0.111 0.834
6 09011 CT New L~ 79.9 0.109 0.890
7 09013 CT Tolla~ 81.8 0.100 0.923
8 09015 CT Windh~ 78.8 0.114 0.848
9 10001 DE Kent ~ 77.8 0.117 0.901
10 10003 DE New C~ 78.7 0.110 0.877
# ... with 21 more rows, and 4 more variables: unemployment <dbl>,
# food_insecurity <dbl>, unemp_cat <fct>, hsgrad_cat <fct>
3.4 Proposal Requirement 4
To meet proposal requirement 4, run describe
from the Hmisc
package.
chr_2019
10 Variables 31 Observations
--------------------------------------------------------------------------------
fipscode
n missing distinct
31 0 31
lowest : 09001 09003 09005 09007 09009, highest: 44001 44003 44005 44007 44009
--------------------------------------------------------------------------------
state
n missing distinct
31 0 6
lowest : CT DC DE HI NH, highest: DC DE HI NH RI
Value CT DC DE HI NH RI
Frequency 8 1 3 4 10 5
Proportion 0.258 0.032 0.097 0.129 0.323 0.161
--------------------------------------------------------------------------------
county
n missing distinct
31 0 30
lowest : Belknap County Bristol County Carroll County Cheshire County Coos County
highest: Sullivan County Sussex County Tolland County Washington County Windham County
--------------------------------------------------------------------------------
life_expectancy
n missing distinct Info Mean Gmd .05 .10
31 0 31 1 79.99 1.719 78.04 78.34
.25 .50 .75 .90 .95
78.98 79.92 81.02 81.99 82.56
lowest : 76.76078 77.80577 78.27487 78.34321 78.41546
highest: 81.80405 81.99087 82.53515 82.57683 82.66931
--------------------------------------------------------------------------------
freq_mental_distress
n missing distinct Info Mean Gmd .05 .10
31 0 31 1 0.1102 0.01115 0.09484 0.09663
.25 .50 .75 .90 .95
0.10358 0.11089 0.11691 0.12052 0.12094
lowest : 0.08533063 0.09358767 0.09609184 0.09663473 0.09852291
highest: 0.11844904 0.12052429 0.12054963 0.12133815 0.13272832
--------------------------------------------------------------------------------
hsgraduation
n missing distinct Info Mean Gmd .05 .10
31 0 31 1 0.8799 0.05573 0.7950 0.8235
.25 .50 .75 .90 .95
0.8517 0.8854 0.9214 0.9319 0.9413
lowest : 0.7236731 0.7892854 0.8007369 0.8235494 0.8338990
highest: 0.9302662 0.9318966 0.9370460 0.9455388 0.9460923
--------------------------------------------------------------------------------
unemployment
n missing distinct Info Mean Gmd .05 .10
31 0 31 1 0.03647 0.01256 0.02219 0.02353
.25 .50 .75 .90 .95
0.02608 0.03922 0.04503 0.04974 0.05024
lowest : 0.02181668 0.02198898 0.02239195 0.02353413 0.02358351
highest: 0.04842678 0.04973634 0.05018781 0.05028742 0.06051724
--------------------------------------------------------------------------------
food_insecurity
n missing distinct Info Mean Gmd .05 .10
31 0 21 0.994 0.1058 0.01585 0.0895 0.0900
.25 .50 .75 .90 .95
0.0975 0.1040 0.1150 0.1240 0.1290
lowest : 0.074 0.089 0.090 0.093 0.095, highest: 0.119 0.124 0.128 0.130 0.132
--------------------------------------------------------------------------------
unemp_cat
n missing distinct
31 0 2
Value high low
Frequency 16 15
Proportion 0.516 0.484
--------------------------------------------------------------------------------
hsgrad_cat
n missing distinct
31 0 3
Value bottom middle top
Frequency 11 10 10
Proportion 0.355 0.323 0.323
--------------------------------------------------------------------------------
3.5 Three Important Checks
There are three important things I have to demonstrate, as described in Tasks C (Identify Your Variables) and D (Create Categorical Variables) in our Data Development work. They are:
- Each of the five variables you select must have data for at least 75% of the counties in each state you plan to study.
Do we have any missing data here?
# A tibble: 1 x 5
life_expectancy freq_mental_distress hsgraduation unemployment food_insecurity
<int> <int> <int> <int> <int>
1 0 0 0 0 0
Nope, so we’re OK!
If I did have some missingness, then I would probably want to summarize this by state, so that I could compare the results. Here’s a way to look at this just for the life_expectancy
variable.
mosaic::favstats(life_expectancy ~ state, data = chr_2019) %>%
select(state, n, missing) %>%
mutate(pct_available = 100*(n - missing)/n) %>%
kable()
state | n | missing | pct_available |
---|---|---|---|
CT | 8 | 0 | 100 |
DC | 1 | 0 | 100 |
DE | 3 | 0 | 100 |
HI | 4 | 0 | 100 |
NH | 10 | 0 | 100 |
RI | 5 | 0 | 100 |
We’re OK, because 100% of the data are available. In my example, this is true for all five of the variables I used. In yours, that may or may not be the case. Remember that all of your selected variables need to be available in at least 75% of the counties in EACH state you study.
- The raw versions of each of your five selected variables must have at least 10 distinct non-missing values.
# A tibble: 1 x 5
life_expectancy freq_mental_distress hsgraduation unemployment food_insecurity
<int> <int> <int> <int> <int>
1 31 31 31 31 21
OK. We’re fine there.
- For each of the categorical variables you create, every level of the resulting factor must include at least 10 counties.
unemp_cat n percent
high 16 0.516129
low 15 0.483871
hsgrad_cat n percent
bottom 11 0.3548387
middle 10 0.3225806
top 10 0.3225806
OK. I have at least 10 counties in each category for each of the categorical variables that I created.
3.6 Saving the Tibble
Finally, we’ll save this tibble as an R data set into the same location as our original data set within our R Project directory.
You’ll want to substitute in your own name, of course.
3.7 Proposal Requirement 5
Having done all of this work, the set of Proposal Requirements (repeated below) should be straightforward. We’ve already dealt with the first four. The fifth is repeated below.
- In a paragraph, describe the most challenging (or difficult) part of completing the work so far, and how you were able to overcome whatever it was that was difficult.
OK. That’s your job.
4 Analyses
This isn’t part of the proposal.
5 Session Information
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggridges_0.5.2 mosaicData_0.20.1 ggformula_0.9.4 ggstance_0.3.4
[5] Matrix_1.2-18 lattice_0.20-41 forcats_0.5.0 stringr_1.4.0
[9] dplyr_1.0.2 purrr_0.3.4 readr_1.3.1 tidyr_1.1.2
[13] tibble_3.0.3 ggplot2_3.3.2 tidyverse_1.3.0 magrittr_1.5
[17] janitor_2.0.1 rmdformats_0.3.7 knitr_1.29
loaded via a namespace (and not attached):
[1] fs_1.5.0 lubridate_1.7.9 RColorBrewer_1.1-2
[4] httr_1.4.2 tools_4.0.2 backports_1.1.10
[7] utf8_1.1.4 R6_2.4.1 rpart_4.1-15
[10] Hmisc_4.4-1 DBI_1.1.0 colorspace_1.4-1
[13] nnet_7.3-14 withr_2.2.0 tidyselect_1.1.0
[16] gridExtra_2.3 leaflet_2.0.3 curl_4.3
[19] compiler_4.0.2 cli_2.0.2 rvest_0.3.6
[22] htmlTable_2.1.0 xml2_1.3.2 ggdendro_0.1.22
[25] bookdown_0.20 checkmate_2.0.0 mosaicCore_0.8.0
[28] scales_1.1.1 digest_0.6.25 foreign_0.8-80
[31] rmarkdown_2.3.3 jpeg_0.1-8.1 base64enc_0.1-3
[34] pkgconfig_2.0.3 htmltools_0.5.0 dbplyr_1.4.4
[37] highr_0.8 htmlwidgets_1.5.1 rlang_0.4.7
[40] readxl_1.3.1 rstudioapi_0.11 farver_2.0.3
[43] generics_0.0.2 jsonlite_1.7.1 crosstalk_1.1.0.1
[46] Formula_1.2-3 Rcpp_1.0.5 munsell_0.5.0
[49] fansi_0.4.1 lifecycle_0.2.0 stringi_1.5.3
[52] yaml_2.2.1 snakecase_0.11.0 MASS_7.3-53
[55] plyr_1.8.6 grid_4.0.2 blob_1.2.1
[58] ggrepel_0.8.2 crayon_1.3.4 haven_2.3.1
[61] splines_4.0.2 hms_0.5.3 pillar_1.4.6
[64] reprex_0.3.0 glue_1.4.2 evaluate_0.14
[67] latticeExtra_0.6-29 data.table_1.13.0 modelr_0.1.8
[70] png_0.1-7 vctrs_0.3.4 tweenr_1.0.1
[73] cellranger_1.1.0 gtable_0.3.0 polyclip_1.10-0
[ reached getOption("max.print") -- omitted 8 entries ]