431 Sample Project A Portfolio Report

Author

Your Name (or Names) Should Be Here

Published

2023-08-23

Some important notes
  1. A template for the Project A proposal is available to you, via the Examples page on the Project A website. Please use it in combination with this document to prepare your proposal. We used it to develop this document.
  2. My instructions and comments in this sample proposal should not appear in your final submitted Project A proposal. They’re just here to help guide you. You need to write your own comments and responses to the Proposal’s requirements.
  3. You need a real title (80 characters, maximum, without using “431” or “Project” or “Project A”) in your proposal. You can, as I have above, include a subtitle, but the main title must stand on its own.

1 R Packages

library(Hmisc)
library(janitor)
library(naniar)
library(sessioninfo)
library(tidyverse)

knitr::opts_chunk$set(comment = NA)

2 Data Ingest

These are data from 2019.

I am ingesting data from the 2019 County Health Rankings, rather than the data you will use.

data_url <- 
  "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2019.csv"

chr_2019_raw <- read_csv(data_url, skip = 1, guess_max = 4000, 
                         show_col_types = FALSE)

Next, we filter these data to the rows which have county_ranked values of 1.

chr_2019_raw <- chr_2019_raw |>
  filter(county_ranked == 1)

The resulting chr_2019_raw tibble now has 3081 rows, and 534 columns.

Inline coding!

Make sure you look at the Quarto file for this document, and note the use of inline coding to get R to tell me the number of rows and number of columns in the resulting chr_2019_raw tibble.

Another approach would have been to use the dim() function here.

3 State Selection

State Choice

In selecting the six states for this sample proposal, I’m using some states you’re not permitted to use. Specifically, I’ve arbitrarily decided to use New York, Ohio, Massachusetts, Pennsylvania, Maine and North Carolina.

Here, I’ll select my six states, then change the state to a factor variable.

chr_2019 <- chr_2019_raw |>
  filter(state %in% c("NY", "OH", "MA", "PA", "ME", "NC")) |>
  mutate(state = factor(state))

Next, I’ll look to see how many counties are in each state.

chr_2019 |> count(state) 
# A tibble: 6 × 2
  state     n
  <fct> <int>
1 MA       14
2 ME       16
3 NC      100
4 NY       62
5 OH       88
6 PA       67

We have selected 6 states, yielding a total of 347 ranked counties, which is between 300 and 800 so we’re all set.

Inline coding, again!

Again, in this last sentence, I’ve used inline coding to get R to tell me the number of states and the number of rows in the resulting chr_2019 tibble.

Here is the place to put a brief description as to why you selected the states that you selected. I will leave that work to you.

4 Variable Selection

Choice of Variables

I will select some variables for this example which are not available to you.

I’ve decided to select variables v128, v065, v024, v052 and v122.

chr_2019 <- chr_2019 |>
  select(fipscode, state, county, county_ranked,
         v128_rawvalue, v065_rawvalue, v024_rawvalue, 
         v052_rawvalue, v122_rawvalue)

I now have a chr_2019 tibble with exactly 9 columns, as required.

5 Variable Cleaning and Renaming

The variables I’m using describe the following measures:

Source for the detailed descriptions below
  • Use this link for the CHR 2023 version of this information.
  • Your version of this material should include the year(s) in which this information was obtained. I’ve left that out here.
Initial Name New Name Role Description
v128_rawvalue child_mort A1 outcome Child mortality (deaths among residents under age 18 per 100,000 population)
v065_rawvalue free_lunch A1 predictor % of children enrolled in public schools that are eligible for free or reduced price lunch
v024_rawvalue child_pov A2 outcome % of people under 18 in poverty
v052_rawvalue below_18 A2 predictor % of county residents below 18 years of age
v122_rawvalue unins_kids Extra % of children under age 19 without health insurance
How Do I need to clean my variables?
  • v065, v024, v052 and v122 are all proportions, that need to be multiplied by 100
  • v128 is OK as is

Here, I’ll multiply the four variables that describe proportions by 100 to obtain percentages instead, to ease interpretation.

chr_2019 <- chr_2019 |>
  mutate(free_lunch = 100*v065_rawvalue,
         child_pov = 100*v024_rawvalue,
         below_18 = 100*v052_rawvalue,
         unins_kids = 100*v122_rawvalue, 
         .keep = "unused") |>
  rename(child_mort = v128_rawvalue)
Let’s check which variables we have now…
dim(chr_2019)
[1] 347   9
names(chr_2019)
[1] "fipscode"      "state"         "county"        "county_ranked"
[5] "child_mort"    "free_lunch"    "child_pov"     "below_18"     
[9] "unins_kids"   

What does this indicate to you about the use of .keep = "unused" in the mutate() function?

6 Creating the Analysis 2 Predictor

To establish our cutpoints, we should look at the 40th and 60th percentiles of the existing data for our planned predictor for Analysis 2, which is below_18.

chr_2019 |>
  summarise(q40 = quantile(below_18, c(0.4)),
            q60 = quantile(below_18, c(0.6)))
# A tibble: 1 × 2
    q40   q60
  <dbl> <dbl>
1  20.3  21.5

So we will create a three-level variable where values of 20.3 and lower will fall in the “Low” group, and values of 21.5 and higher will fall in the “High” group1.

chr_2019 <- chr_2019 |>
  mutate(below18_grp = case_when(
    below_18 <= 20.3 ~ "Low",
    below_18 >= 21.5 ~ "High")) |>
  mutate(below18_grp = factor(below18_grp))

chr_2019 |> count(below18_grp)
# A tibble: 3 × 2
  below18_grp     n
  <fct>       <int>
1 High          140
2 Low           137
3 <NA>           70

It appears that we have a little over 40% of our subjects in the High group and a little under 40% in the Low group, with the rest now listed as missing, and the below18_grp variable is now a factor, so that’s fine.

7 Adding 2018 Data for the Analysis 3 Outcome

In my case, I’ll add data from 2014, since that’s five years prior to the 2019 report.

To do so, I created a file, called chr_2014_raw.csv that contains two variables: the FIPS code and the values of v122_rawvalue for each of the 3,048 counties ranked in 20142.

chr_2014_raw <- read_csv("chr_2014_raw.csv",
                         guess_max = 4000, 
                         show_col_types = FALSE)

chr_2014 <- chr_2014_raw |> mutate(fipscode = as.character(fipscode))

Now, I’ll join the two files.

chr_2019 <- left_join(chr_2019, chr_2014, by = c("fipscode"))

We need to rename the two variables which deal with our Analysis 3 outcome.

chr_2019 <- chr_2019 |>
  rename(unins_kids_2019 = unins_kids, 
         unins_kids_2014 = v122_rawvalue)

8 Arranging and Saving the Analytic Tibble

Now we arrange the variables in the specified order from Data Task 5, and then save the new result to a new .Rds file called chr_2019_YOURNAME.Rds.

chr_2019 <- chr_2019 |>
  select(fipscode, state, county, 
         child_mort, free_lunch, ## Analysis 1 variables
         child_pov, below18_grp, ## Analysis 2 variables
         below_18, ## Quantitative version of group
         unins_kids_2019, unins_kids_2014, ## Analysis 3 variables
         county_ranked)

write_rds(chr_2019, file = "chr_2019_YOURNAME.Rds")

We will make no changes to the chr_2019 tibble after this point in the Proposal.

10 Numerical Summaries

describe(chr_2019)
chr_2019 

 11  Variables      347  Observations
--------------------------------------------------------------------------------
fipscode 
       n  missing distinct 
     347        0      347 

lowest : 23001 23003 23005 23007 23009, highest: 42125 42127 42129 42131 42133
--------------------------------------------------------------------------------
state 
       n  missing distinct 
     347        0        6 
                                              
Value         MA    ME    NC    NY    OH    PA
Frequency     14    16   100    62    88    67
Proportion 0.040 0.046 0.288 0.179 0.254 0.193
--------------------------------------------------------------------------------
county 
       n  missing distinct 
     347        0      282 

lowest : Adams County     Alamance County  Albany County    Alexander County Allegany County 
highest: Wyoming County   Yadkin County    Yancey County    Yates County     York County     
--------------------------------------------------------------------------------
child_mort 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     307       40      307        1    53.32    17.58    29.67    34.20 
     .25      .50      .75      .90      .95 
   42.36    51.59    61.07    73.04    85.87 

lowest : 21.9066 24.7998 26.5791 26.9921 27.0056
highest: 98.2766 98.5884 103.693 107.095 116.575
--------------------------------------------------------------------------------
free_lunch 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     333       14      333        1    52.88    17.83    30.68    35.33 
     .25      .50      .75      .90      .95 
   42.32    51.01    59.62    73.08    92.42 

lowest : 14.7299 17.3719 19.2631 20.0969 20.7698
highest: 97.9279 98.2072 98.3639 98.4087 98.7017
--------------------------------------------------------------------------------
child_pov 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     347        0      195        1    20.36    8.176     9.29    11.56 
     .25      .50      .75      .90      .95 
   15.10    20.10    24.60    29.08    33.00 

lowest : 5.3  5.6  5.9  6.1  6.5 , highest: 40.1 40.2 40.7 43.7 44.4
--------------------------------------------------------------------------------
below18_grp 
       n  missing distinct 
     277       70        2 
                      
Value       High   Low
Frequency    140   137
Proportion 0.505 0.495
--------------------------------------------------------------------------------
below_18 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     347        0      347        1    20.78    2.892    16.60    17.75 
     .25      .50      .75      .90      .95 
   19.19    20.81    22.52    23.78    24.58 

lowest : 9.70262 11.0034 13.0585 13.8239 14.3893
highest: 26.8273 27.2551 27.648  27.9188 31.7401
--------------------------------------------------------------------------------
unins_kids_2019 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     347        0      347        1    4.362    1.689    2.121    2.510 
     .25      .50      .75      .90      .95 
   3.347    4.299    5.165    6.229    7.001 

lowest : 0.764677 0.920235 0.928634 0.930481 1.03025 
highest: 7.86019  8.04137  8.04232  11.9552  16.3218 
--------------------------------------------------------------------------------
unins_kids_2014 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     347        0      347        1  0.06482  0.02224  0.03637  0.04170 
     .25      .50      .75      .90      .95 
 0.05210  0.06270  0.07580  0.09013  0.09896 

lowest : 0.0126525 0.0167432 0.0172625 0.0176735 0.0185266
highest: 0.119201  0.120919  0.131406  0.131839  0.136657 
--------------------------------------------------------------------------------
county_ranked 
       n  missing distinct     Info     Mean      Gmd 
     347        0        1        0        1        0 
              
Value        1
Frequency  347
Proportion   1
--------------------------------------------------------------------------------

11 The Codebook

Our chr_2019 tibble contains 347 counties and 11 variables.

Variable Original Role NA Distinct Definition
fipscode ID 0 347 county’s FIPS code
state ID 0 6 state postal abbreviation
county ID 0 282 county name
child_mort v128 A1 out 40 307 Child mortality (deaths among residents under age 18 per 100,000 population, 2010-2014)
free_lunch v065 A1 pre 14 333 % of children enrolled in public schools that are eligible for free or reduced price lunch
child_pov v024 A2 out 0 195 % of people under 18 in poverty
below18_grp v052 A2 pre 70 2 % of county residents below 18 years of age (Low is \(\leq\) 20.3%, High is \(\geq\) 21.5%)
below_18 v052 Var 4 0 347 Quantitative version of % below 18 years of age
unins_kids_2019 v122 A3 (2019) 0 347 % of children under age 19 without health insurance from CHR 2019
unins_kids_2014 v122 A3 (2014) 0 347 % of children under age 19 without health insurance from CHR 2014
county_ranked Check 0 1 all values are 1

We should check here that we don’t have any variables with more than 20% missingness in any of our variables (other than the Analysis 2 predictor), and that we have at least 15 distinct values for all quantitative variables. You’ll want to affirm this in your proposal, with statements like:

  • [Distinct Values Check]: We have no quantitative variables missing more than 40 of our 347 counties (11.5%) which is less than Project A’s limit of 20%.
  • [Missingness Check]: We have at least 195 distinct values in each of our quantitative variables, which is much larger than the minimum count (15) for Project A.

12 Research Questions

12.1 Analysis 1 Research Question

Here is where you’ll place your research question for Analysis 1, which in my case involves predicting child_mort from free_lunch.

12.2 Analysis 2 Research Question

Here is where you’ll place your research question for Analysis 2, which in my case involves comparing means of child_pov across our two groups in below18_grp.

12.3 Analysis 3 Research Question

Here is where you’ll place your research question for Analysis 2, which in my case involves comparing means of unins_kids in the 2019 report as compared to the 2014 report by CHR.

13 Analysis 1

Delete these instructions when submitting your work

Follow the instructions on the Analyses page carefully.

I’m leaving the Analysis sections to you in this Sample Report.

13.1 Variables

13.2 Summaries

13.3 Approach

13.4 Conclusions

14 Analysis 2

Delete these instructions when submitting your work

Follow the instructions on the Analyses page carefully.

I’m leaving the Analysis sections to you in this Sample Report.

14.1 Variables

14.2 Summaries

14.3 Approach

14.4 Conclusions

15 Analysis 3

Delete these instructions when submitting your work

Follow the instructions on the Analyses page carefully.

I’m leaving the Analysis sections to you in this Sample Report.

15.1 Variables

15.2 Summaries

15.3 Approach

15.4 Conclusions

16 Portfolio Reflections

Delete these instructions when submitting your work

The original “Proposal Reflections” section is only included as Section 13 in the Proposal, and should not be included in the final portfolio report. Instead, write a new paragraph (containing at least four well-constructed complete English sentences) to answer the following question:

What was the most important thing you learned as a result of doing this project, and why?

17 Session Information

session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.1 (2023-06-16 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/New_York
 date     2023-08-23
 pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 backports     1.4.1   2021-12-13 [1] CRAN (R 4.3.0)
 base64enc     0.1-3   2015-07-28 [1] CRAN (R 4.3.0)
 bit           4.0.5   2022-11-15 [1] CRAN (R 4.3.1)
 bit64         4.0.5   2020-08-30 [1] CRAN (R 4.3.1)
 checkmate     2.2.0   2023-04-27 [1] CRAN (R 4.3.1)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.1)
 cluster       2.1.4   2022-08-22 [2] CRAN (R 4.3.1)
 colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.1)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.1)
 curl          5.0.1   2023-06-07 [1] CRAN (R 4.3.1)
 data.table    1.14.8  2023-02-17 [1] CRAN (R 4.3.1)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
 dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.3.1)
 evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.1)
 fansi         1.0.4   2023-01-22 [1] CRAN (R 4.3.1)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.1)
 forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.1)
 foreign       0.8-84  2022-12-06 [2] CRAN (R 4.3.1)
 Formula       1.2-5   2023-02-24 [1] CRAN (R 4.3.0)
 generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.1)
 ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.3.1)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.1)
 gridExtra     2.3     2017-09-09 [1] CRAN (R 4.3.1)
 gtable        0.3.3   2023-03-21 [1] CRAN (R 4.3.1)
 Hmisc       * 5.1-0   2023-05-08 [1] CRAN (R 4.3.1)
 hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.1)
 htmlTable     2.4.1   2022-07-07 [1] CRAN (R 4.3.1)
 htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.3.1)
 htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.1)
 janitor     * 2.2.0   2023-02-02 [1] CRAN (R 4.3.1)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.1)
 knitr         1.43    2023-05-25 [1] CRAN (R 4.3.1)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.3.1)
 lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.3.1)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.1)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.1)
 naniar      * 1.0.0   2023-02-02 [1] CRAN (R 4.3.1)
 nnet          7.3-19  2023-05-03 [2] CRAN (R 4.3.1)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.1)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.1)
 purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.3.1)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.1)
 readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.3.1)
 rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.1)
 rmarkdown     2.23    2023-07-01 [1] CRAN (R 4.3.1)
 rpart         4.1.19  2022-10-21 [2] CRAN (R 4.3.1)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
 scales        1.2.1   2022-08-20 [1] CRAN (R 4.3.1)
 sessioninfo * 1.2.2   2021-12-06 [1] CRAN (R 4.3.1)
 snakecase     0.11.0  2019-05-25 [1] CRAN (R 4.3.1)
 stringi       1.7.12  2023-01-11 [1] CRAN (R 4.3.0)
 stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.3.1)
 tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.1)
 tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.3.1)
 tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.3.1)
 tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.1)
 timechange    0.2.0   2023-01-11 [1] CRAN (R 4.3.1)
 tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.1)
 utf8          1.2.3   2023-01-31 [1] CRAN (R 4.3.1)
 vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.3.1)
 visdat        0.6.0   2023-02-02 [1] CRAN (R 4.3.1)
 vroom         1.6.3   2023-04-28 [1] CRAN (R 4.3.1)
 withr         2.5.0   2022-03-03 [1] CRAN (R 4.3.1)
 xfun          0.39    2023-04-20 [1] CRAN (R 4.3.1)
 yaml          2.3.7   2023-01-23 [1] CRAN (R 4.3.0)

 [1] C:/Users/thoma/AppData/Local/R/win-library/4.3
 [2] C:/Program Files/R/R-4.3.1/library

──────────────────────────────────────────────────────────────────────────────

Footnotes

  1. We could have chosen to use “less than 20.3” and “higher than 21.5” as well, which would potentially have a small impact on our final groups.↩︎

  2. If you want to see this file, I would be happy to share it with you.↩︎