Sample Plan for 431 Project A

Using data from CHR 2018 (and 2013)

Author

Thomas E. Love, Ph.D.

Modified

2024-09-23

Some important notes
  1. An HTML version of this document is available to view at https://rpubs.com/TELOVE/ProjectA-sample-plan-431-2024.

    • Click on the </> Code link at the top right of the document (next to the Table of Contents) to view (and download) the Quarto source code.
  2. A template for the Project A Plan is available to you, via the Examples page on the Project A website. Please use it in combination with this document to prepare revisions, as needed, to your Plan. We used it to develop this document.

  3. You need to write your own comments and responses to the Plan’s requirements. You are welcome to use the words here as illustrative examples of what we’re looking for, but these should be edited by you to be specific to your project.

  4. You need a real title (80 characters, maximum, without using “431” or “Project” or “Project A”) in your Plan. You can, as we have above, include a subtitle, but the main title must stand on its own. Of course, in this sample plan, we used some words you’re not allowed to use, and we will break other rules (and note them) in what follows.

1 R Packages

Code
knitr::opts_chunk$set(comment = NA)

library(janitor)
library(knitr)
library(naniar)
library(xfun)
library(easystats)
library(tidyverse)

theme_set(theme_bw())

url_script <- 
  "https://raw.githubusercontent.com/THOMASELOVE/431-data/refs/heads/main/data/Love-431.R"

source(url_script)

2 Data Ingest

our ingest is different than yours.

These are a few of the things we did differently to get our data.

  1. We are pulling data from 2018 here (and 2013 later) through its URL at the County Health Rankings data and documentation site. You are working with 2024 and 2019 data.

  2. In 2013 and 2018, what is now called county_clustered was called county_ranked, and we need to account for that here.

Code
data_2018_url <- 
  "https://www.countyhealthrankings.org/sites/default/files/analytic_data2018_0.csv"

chr_2018_raw <- read_csv(data_2018_url, skip = 1, guess_max = 4000,
                         show_col_types = FALSE) |>
  rename(county_clustered = county_ranked) |>
  select(fipscode, county, state, county_clustered, year,
         ends_with("rawvalue")) 

Next, we filter these data to the rows which have county_clustered values of 1.

Code
chr_2018_raw <- chr_2018_raw |>
  filter(county_clustered == 1)

The resulting chr_2018_raw tibble now has 3078 rows, and 107 columns.

Inline coding!

Make sure you look at the Quarto file for this document, and note the use of inline coding to get R to tell me the number of rows and number of columns in the resulting chr_2018_raw tibble.

Another approach would have been to use the dim() function here.

3 State Selection

We’re using some states you cannot.

In selecting six states for this sample plan, we’re using some states you’re not permitted to use. Specifically, we have arbitrarily decided to use New York, Ohio, Massachusetts, Pennsylvania, Maine and North Carolina.

Here, we’ll select our six states, then change the state to a factor variable.

Code
chr_2018 <- chr_2018_raw |>
  filter(state %in% c("NY", "OH", "MA", "PA", "ME", "NC")) |>
  mutate(state = factor(state))

Next, we’ll look to see how many counties are in each state.

Code
chr_2018 |> count(state) 
# A tibble: 6 × 2
  state     n
  <fct> <int>
1 MA       14
2 ME       16
3 NC      100
4 NY       62
5 OH       88
6 PA       67

We have selected 6 states, yielding a total of 347 clustered counties, which is between 300 and 800 so we’re all set.

Inline coding, again!

Again, in this last sentence, we’ve used inline coding to get R to tell me the number of states and the number of rows in the resulting chr_2018 tibble.

Here is the place to put a brief description as to why you selected the states that you selected. we will leave that work to you. As for our reason, these are six states in which Dr. Love has spent pleasant summer vacations.

4 Variable Selection

We chose variables you couldn’t choose.

we have selected a set of five variables for this sample plan. None of these variables were available for you to choose.

  • The variables we selected for Analysis 1 turn out to have missing values. Yours may, or may not, in practice.
  • The variables we selected for Analyses 2 and 3 do not have missingness in their raw values, as it turns out.

We’ve decided to select variables v128, v065, v024, v052 and v122.

Code
chr_2018 <- chr_2018 |>
  select(fipscode, state, county, county_clustered,
         v128_rawvalue, v065_rawvalue, v024_rawvalue, 
         v052_rawvalue, v122_rawvalue)

we now have a chr_2018 tibble with exactly 9 columns, as required.

5 Variable Cleaning and Renaming

The variables we are using describe the following measures:

Source for the detailed descriptions below
  • Use this link for the current version of this information.
Initial Name New Name Role Description Gathered
v128_rawvalue child_mort A1 outcome Child mortality (deaths among residents under age 18 per 100,000 population) 2013-16
v065_rawvalue free_lunch A1 predictor % of children enrolled in public schools that are eligible for free or reduced price lunch 2015-16
v024_rawvalue child_pov A2 outcome % of people under 18 in poverty 2016
v052_rawvalue below_18 A2 predictor % of county residents below 18 years of age 2016
v122_rawvalue unins_kids_2018 A3 outcome % of children under age 19 without health insurance 2015
How Do we need to clean our variables?
  • v065, v024, v052 and v122 are all proportions, that need to be multiplied by 100
  • v128 is OK as is

Here, we’ll multiply the four variables that describe proportions by 100 to obtain percentages instead, to ease interpretation.

Code
chr_2018 <- chr_2018 |>
  mutate(free_lunch = 100*v065_rawvalue,
         child_pov = 100*v024_rawvalue,
         below_18 = 100*v052_rawvalue,
         unins_kids_2018 = 100*v122_rawvalue,
         .keep = "unused") |>
  rename(child_mort = v128_rawvalue)
Let’s check which variables we have now…
Code
dim(chr_2018)
[1] 347   9
Code
names(chr_2018)
[1] "fipscode"         "state"            "county"           "county_clustered"
[5] "child_mort"       "free_lunch"       "child_pov"        "below_18"        
[9] "unins_kids_2018" 

What does this indicate to you about the use of .keep = "unused" in the mutate() function?

  • The .keep = "unused" in mutate() retains only the columns not used in the process of creating new columns. This is useful if, as in this case, you want to generate new columns, but no longer need the columns used to generate them. See this reference on mutate().

we renamed v122 as unins_kids_2018 since it was reported in CHR 2018. Soon, we will create unins_kids_2013 as well, for comparison in Analysis 3.

6 Creating the Analysis 2 Predictor

To establish our cut points, we should look at the 40th and 60th percentiles of the existing data for our planned predictor for Analysis 2, which is below_18.

Code
chr_2018 |>
  summarise(q40 = quantile(below_18, c(0.4)),
            q60 = quantile(below_18, c(0.6)))
# A tibble: 1 × 2
    q40   q60
  <dbl> <dbl>
1  20.4  21.7

So we will create a three-level variable where values of 20.4 and lower will fall in the “Low” group, and values of 21.7 and higher will fall in the “High” group1.

Code
chr_2018 <- chr_2018 |>
  mutate(below18_grp = case_when(
    below_18 <= 20.4 ~ "Low",
    below_18 >= 21.7 ~ "High")) |>
  mutate(below18_grp = factor(below18_grp))

chr_2018 |> count(below18_grp)
# A tibble: 3 × 2
  below18_grp     n
  <fct>       <int>
1 High          139
2 Low           139
3 <NA>           69

It appears that we have 139 subjects (40% of the original 347) in the High group and the same number in the Low group, with the rest now listed as missing, and the below18_grp variable is now a factor, so that’s fine. (If you have a slightly different number in “High” than in “Low”, that would also be OK, so long as it’s close to 40% in each group.)

7 Adding 2019 (in our case 2013) Data for the Analysis 3 Outcome

our approach here is a bit different from yours.

In our case, we’ll add data from CHR 2013, since that’s five years prior to the 2018 County Health Rankings report.

Rather than pull this from a .csv file, we will pull it directly from the CHR website, as follows.

The variables we need in our chr_2013_raw file are just the fipscode and our analysis 3 outcome, which starts as the v122_rawvalue variable.

  • The County Health Rankings data documentation for CHR 2013 (pdf) tells me that the data on this variable (v122_rawvalue) come from Small Area Health Insurance Estimates for 2010. We’ll need that year when we build the codebook, later.
Code
data_2013_url <- 
  "https://www.countyhealthrankings.org/sites/default/files/analytic_data2013.csv"

chr_2013 <- read_csv(data_2013_url, skip = 1, guess_max = 4000,
                         show_col_types = FALSE) |>
  rename(county_clustered = county_ranked) |>
  filter(county_clustered == 1) |>
  select(fipscode, v122_rawvalue) |>
  mutate(unins_kids_2013 = 100*v122_rawvalue,
         .keep = "unused")

names(chr_2013)
[1] "fipscode"        "unins_kids_2013"

Now, we’ll join the two files.

Code
chr_2018 <- left_join(chr_2018, chr_2013, by = "fipscode")

Finally, we’ll check to see if this has created any missing values (which would happen if a county in CHR 2018 had data on this variable but did not in CHR 2013.)

Code
n_miss(chr_2018$unins_kids_2013)
[1] 0

8 Arranging and Saving the Analytic Tibble

Now we arrange the variables in the specified order from Data Task 5, and then save the new result to a new .Rds file.

Code
chr_2018 <- chr_2018 |>
  select(fipscode, state, county, 
         child_mort, free_lunch, ## Analysis 1 variables
         child_pov, below18_grp, ## Analysis 2 variables
         below_18, ## Quantitative version of group
         unins_kids_2018, unins_kids_2013, ## Analysis 3 variables
         county_clustered)

write_rds(chr_2018, file = "chr_2018_Thomas_Love.Rds")
Important
  • We will make no changes to the chr_2018 tibble after this point in the Plan.

10 Numerical Summaries

10.1 Table of States by Binary Factor

Code
chr_2018 |> tabyl(state, below18_grp) |> 
  adorn_totals(where = c("row", "col"))
 state High Low NA_ Total
    MA    2   7   5    14
    ME    1  15   0    16
    NC   42  35  23   100
    NY   15  30  17    62
    OH   63   9  16    88
    PA   16  43   8    67
 Total  139 139  69   347
  • As expected, there are some missing values in each column. We have some very small sample sizes in Massachusetts and Maine, but that’s part of the reason why we didn’t let you use those states in your work.

10.2 describe_distribution() results

Code
describe_distribution(chr_2018)
Variable         |  Mean |    SD |   IQR |           Range | Skewness | Kurtosis |   n | n_Missing
--------------------------------------------------------------------------------------------------
child_mort       | 53.66 | 16.59 | 18.09 | [19.96, 125.42] |     1.10 |     2.28 | 309 |        38
free_lunch       | 52.29 | 16.61 | 16.51 |  [14.93, 99.10] |     0.94 |     1.19 | 333 |        14
child_pov        | 21.14 |  7.37 |  9.20 |   [4.70, 57.10] |     0.64 |     1.31 | 347 |         0
below_18         | 20.92 |  2.70 |  3.38 |   [5.15, 32.29] |    -0.54 |     3.96 | 347 |         0
unins_kids_2018  |  4.70 |  1.73 |  1.96 |   [0.83, 17.96] |     1.44 |     9.49 | 347 |         0
unins_kids_2013  |  6.80 |  2.12 |  2.60 |   [1.50, 16.80] |     0.51 |     2.08 | 347 |         0
county_clustered |  1.00 |  0.00 |  0.00 |    [1.00, 1.00] |          |          | 347 |         0

Here, we have minimum and maximum values that make sense for all of the quantitative variables in our data. All of the data reflect information for the appropriate number of counties, since, as mentioned previously, we have missing values in the two Analysis 1 variables we selected (child_mort and free_lunch.)

10.3 data_codebook() results

Code
data_codebook(chr_2018, max_values = 6, range_at = 15)
chr_2018 (347 rows and 11 variables, 11 shown)

ID | Name             | Type        |   Missings |           Values |            N
---+------------------+-------------+------------+------------------+-------------
1  | fipscode         | character   |   0 (0.0%) |            23001 |   1 (  0.3%)
   |                  |             |            |            23003 |   1 (  0.3%)
   |                  |             |            |            23005 |   1 (  0.3%)
   |                  |             |            |            23007 |   1 (  0.3%)
   |                  |             |            |            23009 |   1 (  0.3%)
   |                  |             |            |            23011 |   1 (  0.3%)
   |                  |             |            |            (...) |             
---+------------------+-------------+------------+------------------+-------------
2  | state            | categorical |   0 (0.0%) |               MA |  14 (  4.0%)
   |                  |             |            |               ME |  16 (  4.6%)
   |                  |             |            |               NC | 100 ( 28.8%)
   |                  |             |            |               NY |  62 ( 17.9%)
   |                  |             |            |               OH |  88 ( 25.4%)
   |                  |             |            |               PA |  67 ( 19.3%)
---+------------------+-------------+------------+------------------+-------------
3  | county           | character   |   0 (0.0%) |     Adams County |   2 (  0.6%)
   |                  |             |            |  Alamance County |   1 (  0.3%)
   |                  |             |            |    Albany County |   1 (  0.3%)
   |                  |             |            | Alexander County |   1 (  0.3%)
   |                  |             |            |  Allegany County |   1 (  0.3%)
   |                  |             |            | Alleghany County |   1 (  0.3%)
   |                  |             |            |            (...) |             
---+------------------+-------------+------------+------------------+-------------
4  | child_mort       | numeric     | 38 (11.0%) |  [19.96, 125.42] |          309
---+------------------+-------------+------------+------------------+-------------
5  | free_lunch       | numeric     |  14 (4.0%) |    [14.93, 99.1] |          333
---+------------------+-------------+------------+------------------+-------------
6  | child_pov        | numeric     |   0 (0.0%) |      [4.7, 57.1] |          347
---+------------------+-------------+------------+------------------+-------------
7  | below18_grp      | categorical | 69 (19.9%) |             High | 139 ( 50.0%)
   |                  |             |            |              Low | 139 ( 50.0%)
---+------------------+-------------+------------+------------------+-------------
8  | below_18         | numeric     |   0 (0.0%) |    [5.15, 32.29] |          347
---+------------------+-------------+------------+------------------+-------------
9  | unins_kids_2018  | numeric     |   0 (0.0%) |    [0.83, 17.96] |          347
---+------------------+-------------+------------+------------------+-------------
10 | unins_kids_2013  | numeric     |   0 (0.0%) |      [1.5, 16.8] |          347
---+------------------+-------------+------------+------------------+-------------
11 | county_clustered | numeric     |   0 (0.0%) |                1 | 347 (100.0%)
----------------------------------------------------------------------------------
  • All of our planned outcome and quantitative predictor values show reasonable minimum and maximum values.
  • We have less than 20% missing values in each of our Analysis 1 variables, and no missingness in our Analysis 2 or Analysis 3 outcomes, and we have, as expected, 20% missing data in our binary factor for Analysis 2.

So, we pass all of the necessary checks.

10.4 Distinct Values

A problem with the initial instructions

What I originally told you to do was this:

Code
chr_2018 |> 
  summarise(across(everything(), ~ n_distinct(.)))
# A tibble: 1 × 11
  fipscode state county child_mort free_lunch child_pov below18_grp below_18
     <int> <int>  <int>      <int>      <int>     <int>       <int>    <int>
1      347     6    282        310        334       194           3      347
# ℹ 3 more variables: unins_kids_2018 <int>, unins_kids_2013 <int>,
#   county_clustered <int>

but the problem here is that some of the results we want to see don’t turn up in the printed output.

You could show, for instance, the counts for the last five variables in another call to this function, as follows:

Code
chr_2018 |> 
  summarise(across(everything(), ~ n_distinct(.))) |>
  select(7:11)
# A tibble: 1 × 5
  below18_grp below_18 unins_kids_2018 unins_kids_2013 county_clustered
        <int>    <int>           <int>           <int>            <int>
1           3      347             346              84                1

Another simple and attractive enough way to show the results of this check for all 11 variables, is to use the kable() function from the knitr package, as we have done below. There we can see all eleven results if we scroll through the HTML.

Code
chr_2018 |> 
  summarise(across(everything(), ~ n_distinct(.))) |>
  kable()
fipscode state county child_mort free_lunch child_pov below18_grp below_18 unins_kids_2018 unins_kids_2013 county_clustered
347 6 282 310 334 194 3 347 346 84 1
  • We have a distinct fipscode for each of our 347 counties.
  • We have at least 15 distinct values in our outcomes (child_mort, child_pov and unins_kids in each year) and in our quantitative predictor (free_lunch) for Analysis 1.
  • We have 6 states and we have the same value (1) in county_clustered for every row of our data, so that’s correct, too.

So we pass all of the necessary checks here, as well.

Avoid scrolling here?

Here’s a way to avoid the scrolling window in HTML…

Code
tab10_4 <- chr_2018 |> 
  summarise(across(everything(), ~ n_distinct(.)))

tab10_4 |> select(1:5) |> kable()
fipscode state county child_mort free_lunch
347 6 282 310 334
Code
tab10_4 |> select(6:11) |> kable()
child_pov below18_grp below_18 unins_kids_2018 unins_kids_2013 county_clustered
194 3 347 346 84 1

There are other, fancier, approaches we could use, but we will be happy with any of these, so long as we can see the results for all 11 columns.

11 The Codebook

Our chr_2018 tibble contains 347 counties and 11 variables.

Variable Role Old Name Description Year(s)
fipscode ID fipscode FIPS code
state ID state State Abbreviation (OH, MA, ME, NC, NY, PA)
county ID county County Name
child_mort A1 outcome v128 Child mortality (deaths among residents under age 18 per 100,000 population) 2013-16
free_lunch A1 predictor v065 % of children enrolled in public schools that are eligible for free or reduced price lunch 2015-16
child_pov A2 outcome v024 % of people under 18 in poverty 2016
below_18_grp A2 predictor - Low (below_18 \(\leq\) 20.4) or High (below_18 \(\geq\) 21.7) % of county residents below 18 years of age 2016
below_18 v052 % of county residents below 18 years of age 2016
unins_kids_2018 A3 outcome v122 % of children under age 19 without health insurance, CHR 2018 2015
unins_kids_2013 A3 outcome v122 % of children under age 19 without health insurance, CHR 2013 2010
county_clustered - county_clustered Indicates county is ranked (all values are 1, as required) 2024

12 Research Questions

12.1 Analysis 1 Research Question

Here is where you’ll place your research question for Analysis 1, which in our case involves predicting child_mort from free_lunch.

12.2 Analysis 2 Research Question

Here is where you’ll place your research question for Analysis 2, which in our case involves comparing means of child_pov across our two groups in below18_grp.

12.3 Analysis 3 Research Question

Here is where you’ll place your research question for Analysis 3, which in our case involves comparing means of unins_kids in the 2018 report (where the data were measured in 2015) as compared to the 2013 report by CHR (where the data come from 2010).

13 Reflection

Here is where you’ll place your reflection. We’ll leave that to you.

14 Session Information

Code
xfun::session_info()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  askpass_1.2.0       backports_1.5.0     base64enc_0.1.3    
  bayestestR_0.14.0   bit_4.0.5           bit64_4.0.5        
  blob_1.2.4          broom_1.0.6         bslib_0.8.0        
  cachem_1.1.0        callr_3.7.6         cellranger_1.1.0   
  cli_3.6.3           clipr_0.8.0         coda_0.19-4.1      
  codetools_0.2-20    colorspace_2.1-1    compiler_4.4.1     
  conflicted_1.2.0    correlation_0.8.5   cpp11_0.5.0        
  crayon_1.5.3        curl_5.2.2          data.table_1.16.0  
  datasets_4.4.1      datawizard_0.12.3   DBI_1.2.3          
  dbplyr_2.5.0        digest_0.6.37       dplyr_1.1.4        
  dtplyr_1.3.1        easystats_0.7.3     effectsize_0.8.9   
  emmeans_1.10.4      estimability_1.5.1  evaluate_1.0.0     
  fansi_1.0.6         farver_2.1.2        fastmap_1.2.0      
  fontawesome_0.5.2   forcats_1.0.0       fs_1.6.4           
  gargle_1.5.2        generics_0.1.3      ggplot2_3.5.1      
  glue_1.7.0          googledrive_2.1.1   googlesheets4_1.1.1
  graphics_4.4.1      grDevices_4.4.1     grid_4.4.1         
  gridExtra_2.3       gtable_0.3.5        haven_2.5.4        
  highr_0.11          hms_1.1.3           htmltools_0.5.8.1  
  htmlwidgets_1.6.4   httr_1.4.7          ids_1.0.1          
  insight_0.20.4      isoband_0.2.7       janitor_2.2.0      
  jquerylib_0.1.4     jsonlite_1.8.9      knitr_1.48         
  labeling_0.4.3      lattice_0.22-6      lifecycle_1.0.4    
  lubridate_1.9.3     magrittr_2.0.3      MASS_7.3-61        
  Matrix_1.7-0        memoise_2.0.1       methods_4.4.1      
  mgcv_1.9.1          mime_0.12           modelbased_0.8.8   
  modelr_0.1.11       multcomp_1.4-26     munsell_0.5.1      
  mvtnorm_1.3-1       naniar_1.1.0        nlme_3.1.164       
  norm_1.0.11.1       numDeriv_2016.8.1.1 openssl_2.2.1      
  parallel_4.4.1      parameters_0.22.2   performance_0.12.3 
  pillar_1.9.0        pkgconfig_2.0.3     plyr_1.8.9         
  prettyunits_1.2.0   processx_3.8.4      progress_1.2.3     
  ps_1.8.0            purrr_1.0.2         R6_2.5.1           
  ragg_1.3.2          rappdirs_0.3.3      RColorBrewer_1.1.3 
  Rcpp_1.0.13         readr_2.1.5         readxl_1.4.3       
  rematch_2.0.0       rematch2_2.1.2      report_0.5.9       
  reprex_2.1.1        rlang_1.1.4         rmarkdown_2.28     
  rstudioapi_0.16.0   rvest_1.0.4         sandwich_3.1-1     
  sass_0.4.9          scales_1.3.0        see_0.9.0          
  selectr_0.4.2       snakecase_0.11.1    splines_4.4.1      
  stats_4.4.1         stringi_1.8.4       stringr_1.5.1      
  survival_3.7-0      sys_3.4.2           systemfonts_1.1.0  
  textshaping_0.4.0   TH.data_1.1-2       tibble_3.2.1       
  tidyr_1.3.1         tidyselect_1.2.1    tidyverse_2.0.0    
  timechange_0.3.0    tinytex_0.53        tools_4.4.1        
  tzdb_0.4.0          UpSetR_1.4.0        utf8_1.2.4         
  utils_4.4.1         uuid_1.2.1          vctrs_0.6.5        
  viridis_0.6.5       viridisLite_0.4.2   visdat_0.6.0       
  vroom_1.6.5         withr_3.0.1         xfun_0.47          
  xml2_1.3.6          xtable_1.8-4        yaml_2.3.10        
  zoo_1.8-12         

Footnotes

  1. We could have chosen to use “less than 20.4” and “higher than 21.7” as well, which would potentially have a small impact on our final groups.↩︎