library(Hmisc)
library(janitor)
library(naniar)
library(sessioninfo)
library(tidyverse)
::opts_chunk$set(comment = NA) knitr
431 Sample Project A Portfolio Report
- A template for the Project A proposal is available to you, via the Examples page on the Project A website. Please use it in combination with this document to prepare your proposal. We used it to develop this document.
- My instructions and comments in this sample proposal should not appear in your final submitted Project A proposal. They’re just here to help guide you. You need to write your own comments and responses to the Proposal’s requirements.
- You need a real title (80 characters, maximum, without using “431” or “Project” or “Project A”) in your proposal. You can, as I have above, include a subtitle, but the main title must stand on its own.
1 R Packages
2 Data Ingest
I am ingesting data from the 2019 County Health Rankings, rather than the data you will use.
<-
data_url "https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2019.csv"
<- read_csv(data_url, skip = 1, guess_max = 4000,
chr_2019_raw show_col_types = FALSE)
Next, we filter these data to the rows which have county_ranked
values of 1.
<- chr_2019_raw |>
chr_2019_raw filter(county_ranked == 1)
The resulting chr_2019_raw tibble now has 3081 rows, and 534 columns.
Make sure you look at the Quarto file for this document, and note the use of inline coding to get R to tell me the number of rows and number of columns in the resulting chr_2019_raw
tibble.
Another approach would have been to use the dim()
function here.
3 State Selection
In selecting the six states for this sample proposal, I’m using some states you’re not permitted to use. Specifically, I’ve arbitrarily decided to use New York, Ohio, Massachusetts, Pennsylvania, Maine and North Carolina.
Here, I’ll select my six states, then change the state
to a factor variable.
<- chr_2019_raw |>
chr_2019 filter(state %in% c("NY", "OH", "MA", "PA", "ME", "NC")) |>
mutate(state = factor(state))
Next, I’ll look to see how many counties are in each state
.
|> count(state) chr_2019
# A tibble: 6 × 2
state n
<fct> <int>
1 MA 14
2 ME 16
3 NC 100
4 NY 62
5 OH 88
6 PA 67
We have selected 6 states, yielding a total of 347 ranked counties, which is between 300 and 800 so we’re all set.
Again, in this last sentence, I’ve used inline coding to get R to tell me the number of states and the number of rows in the resulting chr_2019
tibble.
Here is the place to put a brief description as to why you selected the states that you selected. I will leave that work to you.
4 Variable Selection
I will select some variables for this example which are not available to you.
I’ve decided to select variables v128
, v065
, v024
, v052
and v122
.
<- chr_2019 |>
chr_2019 select(fipscode, state, county, county_ranked,
v128_rawvalue, v065_rawvalue, v024_rawvalue, v052_rawvalue, v122_rawvalue)
I now have a chr_2019
tibble with exactly 9 columns, as required.
5 Variable Cleaning and Renaming
The variables I’m using describe the following measures:
- Use this link for the CHR 2023 version of this information.
- Your version of this material should include the year(s) in which this information was obtained. I’ve left that out here.
Initial Name | New Name | Role | Description |
---|---|---|---|
v128_rawvalue |
child_mort |
A1 outcome | Child mortality (deaths among residents under age 18 per 100,000 population) |
v065_rawvalue |
free_lunch |
A1 predictor | % of children enrolled in public schools that are eligible for free or reduced price lunch |
v024_rawvalue |
child_pov |
A2 outcome | % of people under 18 in poverty |
v052_rawvalue |
below_18 |
A2 predictor | % of county residents below 18 years of age |
v122_rawvalue |
unins_kids |
Extra | % of children under age 19 without health insurance |
v065
,v024
,v052
andv122
are all proportions, that need to be multiplied by 100v128
is OK as is
Here, I’ll multiply the four variables that describe proportions by 100 to obtain percentages instead, to ease interpretation.
<- chr_2019 |>
chr_2019 mutate(free_lunch = 100*v065_rawvalue,
child_pov = 100*v024_rawvalue,
below_18 = 100*v052_rawvalue,
unins_kids = 100*v122_rawvalue,
.keep = "unused") |>
rename(child_mort = v128_rawvalue)
dim(chr_2019)
[1] 347 9
names(chr_2019)
[1] "fipscode" "state" "county" "county_ranked"
[5] "child_mort" "free_lunch" "child_pov" "below_18"
[9] "unins_kids"
What does this indicate to you about the use of .keep = "unused"
in the mutate()
function?
6 Creating the Analysis 2 Predictor
To establish our cutpoints, we should look at the 40th and 60th percentiles of the existing data for our planned predictor for Analysis 2, which is below_18
.
|>
chr_2019 summarise(q40 = quantile(below_18, c(0.4)),
q60 = quantile(below_18, c(0.6)))
# A tibble: 1 × 2
q40 q60
<dbl> <dbl>
1 20.3 21.5
So we will create a three-level variable where values of 20.3 and lower will fall in the “Low” group, and values of 21.5 and higher will fall in the “High” group1.
<- chr_2019 |>
chr_2019 mutate(below18_grp = case_when(
<= 20.3 ~ "Low",
below_18 >= 21.5 ~ "High")) |>
below_18 mutate(below18_grp = factor(below18_grp))
|> count(below18_grp) chr_2019
# A tibble: 3 × 2
below18_grp n
<fct> <int>
1 High 140
2 Low 137
3 <NA> 70
It appears that we have a little over 40% of our subjects in the High group and a little under 40% in the Low group, with the rest now listed as missing, and the below18_grp
variable is now a factor, so that’s fine.
7 Adding 2018 Data for the Analysis 3 Outcome
In my case, I’ll add data from 2014, since that’s five years prior to the 2019 report.
To do so, I created a file, called chr_2014_raw.csv
that contains two variables: the FIPS code and the values of v122_rawvalue
for each of the 3,048 counties ranked in 20142.
<- read_csv("chr_2014_raw.csv",
chr_2014_raw guess_max = 4000,
show_col_types = FALSE)
<- chr_2014_raw |> mutate(fipscode = as.character(fipscode)) chr_2014
Now, I’ll join the two files.
<- left_join(chr_2019, chr_2014, by = c("fipscode")) chr_2019
We need to rename the two variables which deal with our Analysis 3 outcome.
<- chr_2019 |>
chr_2019 rename(unins_kids_2019 = unins_kids,
unins_kids_2014 = v122_rawvalue)
8 Arranging and Saving the Analytic Tibble
Now we arrange the variables in the specified order from Data Task 5, and then save the new result to a new .Rds file called chr_2019_YOURNAME.Rds.
<- chr_2019 |>
chr_2019 select(fipscode, state, county,
## Analysis 1 variables
child_mort, free_lunch, ## Analysis 2 variables
child_pov, below18_grp, ## Quantitative version of group
below_18, ## Analysis 3 variables
unins_kids_2019, unins_kids_2014,
county_ranked)
write_rds(chr_2019, file = "chr_2019_YOURNAME.Rds")
We will make no changes to the chr_2019 tibble after this point in the Proposal.
9 Print the Tibble
We print the tibble, to prove it is one. If it is a tibble, only the first 10 rows will print, like this:
chr_2019
# A tibble: 347 × 11
fipscode state county child_mort free_lunch child_pov below18_grp below_18
<chr> <fct> <chr> <dbl> <dbl> <dbl> <fct> <dbl>
1 23001 ME Androsco… 51.1 56.5 15.8 High 21.8
2 23003 ME Aroostoo… 76.9 54.1 17.5 Low 18.3
3 23005 ME Cumberla… 40.7 32.9 9.2 Low 18.9
4 23007 ME Franklin… 63.9 51.9 19.5 Low 17.7
5 23009 ME Hancock … 31.5 42.8 15 Low 17.3
6 23011 ME Kennebec… 49.4 49.2 14.2 Low 19.5
7 23013 ME Knox Cou… 48.3 42.0 14.7 Low 17.9
8 23015 ME Lincoln … 59.8 47.1 15.3 Low 16.6
9 23017 ME Oxford C… 31.9 61.0 20.2 Low 18.7
10 23019 ME Penobsco… 58.6 47.2 16.2 Low 18.3
# ℹ 337 more rows
# ℹ 3 more variables: unins_kids_2019 <dbl>, unins_kids_2014 <dbl>,
# county_ranked <dbl>
We can see that each of the variables is of the appropriate type:
character
forfipscode
andcounty
double
for all quantitative measuresfactor
forstate
andbelow18_grp
so, again, we’re OK.
10 Numerical Summaries
describe(chr_2019)
chr_2019
11 Variables 347 Observations
--------------------------------------------------------------------------------
fipscode
n missing distinct
347 0 347
lowest : 23001 23003 23005 23007 23009, highest: 42125 42127 42129 42131 42133
--------------------------------------------------------------------------------
state
n missing distinct
347 0 6
Value MA ME NC NY OH PA
Frequency 14 16 100 62 88 67
Proportion 0.040 0.046 0.288 0.179 0.254 0.193
--------------------------------------------------------------------------------
county
n missing distinct
347 0 282
lowest : Adams County Alamance County Albany County Alexander County Allegany County
highest: Wyoming County Yadkin County Yancey County Yates County York County
--------------------------------------------------------------------------------
child_mort
n missing distinct Info Mean Gmd .05 .10
307 40 307 1 53.32 17.58 29.67 34.20
.25 .50 .75 .90 .95
42.36 51.59 61.07 73.04 85.87
lowest : 21.9066 24.7998 26.5791 26.9921 27.0056
highest: 98.2766 98.5884 103.693 107.095 116.575
--------------------------------------------------------------------------------
free_lunch
n missing distinct Info Mean Gmd .05 .10
333 14 333 1 52.88 17.83 30.68 35.33
.25 .50 .75 .90 .95
42.32 51.01 59.62 73.08 92.42
lowest : 14.7299 17.3719 19.2631 20.0969 20.7698
highest: 97.9279 98.2072 98.3639 98.4087 98.7017
--------------------------------------------------------------------------------
child_pov
n missing distinct Info Mean Gmd .05 .10
347 0 195 1 20.36 8.176 9.29 11.56
.25 .50 .75 .90 .95
15.10 20.10 24.60 29.08 33.00
lowest : 5.3 5.6 5.9 6.1 6.5 , highest: 40.1 40.2 40.7 43.7 44.4
--------------------------------------------------------------------------------
below18_grp
n missing distinct
277 70 2
Value High Low
Frequency 140 137
Proportion 0.505 0.495
--------------------------------------------------------------------------------
below_18
n missing distinct Info Mean Gmd .05 .10
347 0 347 1 20.78 2.892 16.60 17.75
.25 .50 .75 .90 .95
19.19 20.81 22.52 23.78 24.58
lowest : 9.70262 11.0034 13.0585 13.8239 14.3893
highest: 26.8273 27.2551 27.648 27.9188 31.7401
--------------------------------------------------------------------------------
unins_kids_2019
n missing distinct Info Mean Gmd .05 .10
347 0 347 1 4.362 1.689 2.121 2.510
.25 .50 .75 .90 .95
3.347 4.299 5.165 6.229 7.001
lowest : 0.764677 0.920235 0.928634 0.930481 1.03025
highest: 7.86019 8.04137 8.04232 11.9552 16.3218
--------------------------------------------------------------------------------
unins_kids_2014
n missing distinct Info Mean Gmd .05 .10
347 0 347 1 0.06482 0.02224 0.03637 0.04170
.25 .50 .75 .90 .95
0.05210 0.06270 0.07580 0.09013 0.09896
lowest : 0.0126525 0.0167432 0.0172625 0.0176735 0.0185266
highest: 0.119201 0.120919 0.131406 0.131839 0.136657
--------------------------------------------------------------------------------
county_ranked
n missing distinct Info Mean Gmd
347 0 1 0 1 0
Value 1
Frequency 347
Proportion 1
--------------------------------------------------------------------------------
11 The Codebook
Our chr_2019 tibble contains 347 counties and 11 variables.
Variable | Original | Role | NA | Distinct | Definition |
---|---|---|---|---|---|
fipscode | – | ID | 0 | 347 | county’s FIPS code |
state | – | ID | 0 | 6 | state postal abbreviation |
county | – | ID | 0 | 282 | county name |
child_mort | v128 |
A1 out | 40 | 307 | Child mortality (deaths among residents under age 18 per 100,000 population, 2010-2014) |
free_lunch | v065 |
A1 pre | 14 | 333 | % of children enrolled in public schools that are eligible for free or reduced price lunch |
child_pov | v024 |
A2 out | 0 | 195 | % of people under 18 in poverty |
below18_grp | v052 |
A2 pre | 70 | 2 | % of county residents below 18 years of age (Low is \(\leq\) 20.3%, High is \(\geq\) 21.5%) |
below_18 | v052 |
Var 4 | 0 | 347 | Quantitative version of % below 18 years of age |
unins_kids_2019 | v122 |
A3 (2019) | 0 | 347 | % of children under age 19 without health insurance from CHR 2019 |
unins_kids_2014 | v122 |
A3 (2014) | 0 | 347 | % of children under age 19 without health insurance from CHR 2014 |
county_ranked | – | Check | 0 | 1 | all values are 1 |
We should check here that we don’t have any variables with more than 20% missingness in any of our variables (other than the Analysis 2 predictor), and that we have at least 15 distinct values for all quantitative variables. You’ll want to affirm this in your proposal, with statements like:
- [Distinct Values Check]: We have no quantitative variables missing more than 40 of our 347 counties (11.5%) which is less than Project A’s limit of 20%.
- [Missingness Check]: We have at least 195 distinct values in each of our quantitative variables, which is much larger than the minimum count (15) for Project A.
12 Research Questions
12.1 Analysis 1 Research Question
Here is where you’ll place your research question for Analysis 1, which in my case involves predicting child_mort
from free_lunch
.
12.2 Analysis 2 Research Question
Here is where you’ll place your research question for Analysis 2, which in my case involves comparing means of child_pov
across our two groups in below18_grp
.
12.3 Analysis 3 Research Question
Here is where you’ll place your research question for Analysis 2, which in my case involves comparing means of unins_kids
in the 2019 report as compared to the 2014 report by CHR.
13 Analysis 1
Follow the instructions on the Analyses page carefully.
I’m leaving the Analysis sections to you in this Sample Report.
13.1 Variables
13.2 Summaries
13.3 Approach
13.4 Conclusions
14 Analysis 2
Follow the instructions on the Analyses page carefully.
I’m leaving the Analysis sections to you in this Sample Report.
14.1 Variables
14.2 Summaries
14.3 Approach
14.4 Conclusions
15 Analysis 3
Follow the instructions on the Analyses page carefully.
I’m leaving the Analysis sections to you in this Sample Report.
15.1 Variables
15.2 Summaries
15.3 Approach
15.4 Conclusions
16 Portfolio Reflections
The original “Proposal Reflections” section is only included as Section 13 in the Proposal, and should not be included in the final portfolio report. Instead, write a new paragraph (containing at least four well-constructed complete English sentences) to answer the following question:
What was the most important thing you learned as a result of doing this project, and why?
17 Session Information
session_info()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16 ucrt)
os Windows 11 x64 (build 22621)
system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.utf8
ctype English_United States.utf8
tz America/New_York
date 2023-08-23
pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.0)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.3.0)
bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.1)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.1)
checkmate 2.2.0 2023-04-27 [1] CRAN (R 4.3.1)
cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.1)
cluster 2.1.4 2022-08-22 [2] CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.1)
crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.1)
curl 5.0.1 2023-06-07 [1] CRAN (R 4.3.1)
data.table 1.14.8 2023-02-17 [1] CRAN (R 4.3.1)
digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
dplyr * 1.1.2 2023-04-20 [1] CRAN (R 4.3.1)
evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.1)
fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.1)
fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.1)
forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.1)
foreign 0.8-84 2022-12-06 [2] CRAN (R 4.3.1)
Formula 1.2-5 2023-02-24 [1] CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.1)
ggplot2 * 3.4.2 2023-04-03 [1] CRAN (R 4.3.1)
glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.1)
gridExtra 2.3 2017-09-09 [1] CRAN (R 4.3.1)
gtable 0.3.3 2023-03-21 [1] CRAN (R 4.3.1)
Hmisc * 5.1-0 2023-05-08 [1] CRAN (R 4.3.1)
hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.1)
htmlTable 2.4.1 2022-07-07 [1] CRAN (R 4.3.1)
htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.1)
htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.1)
janitor * 2.2.0 2023-02-02 [1] CRAN (R 4.3.1)
jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.1)
knitr 1.43 2023-05-25 [1] CRAN (R 4.3.1)
lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.1)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.1)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.1)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.1)
naniar * 1.0.0 2023-02-02 [1] CRAN (R 4.3.1)
nnet 7.3-19 2023-05-03 [2] CRAN (R 4.3.1)
pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.1)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.1)
purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.3.1)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.1)
readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.1)
rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.1)
rmarkdown 2.23 2023-07-01 [1] CRAN (R 4.3.1)
rpart 4.1.19 2022-10-21 [2] CRAN (R 4.3.1)
rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
scales 1.2.1 2022-08-20 [1] CRAN (R 4.3.1)
sessioninfo * 1.2.2 2021-12-06 [1] CRAN (R 4.3.1)
snakecase 0.11.0 2019-05-25 [1] CRAN (R 4.3.1)
stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.1)
tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.1)
tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.1)
tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.1)
tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.1)
timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.1)
tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.1)
utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.1)
vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.1)
visdat 0.6.0 2023-02-02 [1] CRAN (R 4.3.1)
vroom 1.6.3 2023-04-28 [1] CRAN (R 4.3.1)
withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.1)
xfun 0.39 2023-04-20 [1] CRAN (R 4.3.1)
yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
[1] C:/Users/thoma/AppData/Local/R/win-library/4.3
[2] C:/Program Files/R/R-4.3.1/library
──────────────────────────────────────────────────────────────────────────────