1 Preliminaries
1.1 My R Packages
This code snippet will list packages that going to be used throughout the proposal.
library(janitor)
library(naniar)
library(tidyverse)
library(mosaic)
library(readr)
library(utils)
library(ggplot2)
library(patchwork)
1.2 Data Ingest
This following code snippet uses to load the data of 2022 County Health Rankings (CHR).
data_url <-
"https://www.countyhealthrankings.org/sites/default/files/media/document/analytic_data2022.csv"
chr_2022_raw <- read_csv(data_url, skip = 1, guess_max = 4000,
show_col_types = FALSE)
2 Data Development
2.1 Selecting My Data
For this proposal, I will select from the six “states”: Ohio
(OH), New York (NY), Pennsylvania
(PA), Virginia (VA), Massachusetts
(MA), and Maryland (MD). These states are
considered as East Seaboard states of the U.S except for Ohio, which is
Midwest. I’ve selected five variables
freq_mental_distress (v145_rawvalue),
unemployment (v023_rawvalue),
food_insecurity (v139_rawvalue),
suicides (v161_rawvalue), and
alcohol_impaired_driving_deaths
(v134_rawvalue) since I am interested in how unemployment can
make an impact on mental health as well as other aspects in lives.
From that, the chunk of code below would try to complete the following tasks:
- Task 1: Filter the data to the actual counties that
are ranked in the Rankings (this eliminates state and USA totals,
mainly.) This is done by filtering to only keep rows that have
county_rankedequal to1. - Task 2: Filter 6 states that I’ve selected (the
%in%command lets us include any state in the list we then create with the c() function), including OH:OH,NY,PA,VA,MA,MD. - Task 3: Select the variables that I’ve chosen to
use in this study, including the 3 mandatory variables
(
fipscode,state, andcounty):v145,v023,v139,v161,v134- Rename the 5 variables into meaningful names. These names are the actual meaning of the variables.
- Except for the
suicidesvariable, the other four variables are listed in proportions, so I will rescale it to percentage.
- Repair the
fipscodeby converting it into a character variable instead numerical value and factoring thestate
chr_2022 <- chr_2022_raw |>
filter(county_ranked == 1) |>
filter(state %in% c("OH", "NY", "PA", "VA", "MA", "MD")) |>
select(fipscode, state, county, county_ranked,
v145_rawvalue, v023_rawvalue, v139_rawvalue, v161_rawvalue, v134_rawvalue) |>
rename(freq_mental_distress = v145_rawvalue,
unemployment = v023_rawvalue,
food_insecurity = v139_rawvalue,
suicides = v161_rawvalue,
alcohol_impaired_driving_deaths = v134_rawvalue) |>
mutate(freq_mental_distress = 100*freq_mental_distress ,
unemployment = 100*unemployment,
food_insecurity = 100*food_insecurity,
alcohol_impaired_driving_deaths = 100*alcohol_impaired_driving_deaths) |>
mutate(fipscode = str_pad(fipscode, 5, pad="0"),
state = factor(state))
2.1.1 Checking Initial Work
After completing the tasks above, we want to check our initial work. This also checks if the total number of counties from our selected states can be around 200 - 400 counties or not.
glimpse(chr_2022)
Rows: 388
Columns: 9
$ fipscode <chr> "24001", "24003", "24005", "24009", "2…
$ state <fct> MD, MD, MD, MD, MD, MD, MD, MD, MD, MD…
$ county <chr> "Allegany County", "Anne Arundel Count…
$ county_ranked <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ freq_mental_distress <dbl> 16.5, 12.2, 13.8, 13.1, 16.3, 13.1, 14…
$ unemployment <dbl> 7.776877, 5.832230, 6.827988, 5.226294…
$ food_insecurity <dbl> 16.9, 9.0, 10.7, 8.2, 13.3, 9.6, 12.4,…
$ suicides <dbl> 16.225499, 12.317121, 10.585320, 10.39…
$ alcohol_impaired_driving_deaths <dbl> 48.48485, 29.18455, 26.78571, 29.54545…
Since the number of rows is 388, it should be accepted as it falls within the required range.
2.1.2 Checking missing values
(Part of Task 6) We will check about the missing values for each variables
chr_2022 |>
miss_var_summary() |>
kable()
| variable | n_miss | pct_miss |
|---|---|---|
| suicides | 24 | 6.185567 |
| alcohol_impaired_driving_deaths | 6 | 1.546392 |
| fipscode | 0 | 0.000000 |
| state | 0 | 0.000000 |
| county | 0 | 0.000000 |
| county_ranked | 0 | 0.000000 |
| freq_mental_distress | 0 | 0.000000 |
| unemployment | 0 | 0.000000 |
| food_insecurity | 0 | 0.000000 |
2.1.3 Checking percentage available of missing-value-containing variables for each state
(Part of Task 7) From the code snippet above,
suicides and alcohol_impaired_driving_deaths
have missing values. I want to summarize this by the state in order to
compare the result.
- For
suicidesvariable:
mosaic::favstats(suicides ~ state, data = chr_2022) |>
select(state, n, missing) |>
mutate(pct_available = 100*(n - missing)/n) |>
kable()
| state | n | missing | pct_available |
|---|---|---|---|
| MA | 13 | 1 | 92.30769 |
| MD | 24 | 0 | 100.00000 |
| NY | 61 | 1 | 98.36066 |
| OH | 87 | 1 | 98.85057 |
| PA | 64 | 3 | 95.31250 |
| VA | 115 | 18 | 84.34783 |
Since the pct_variable for each state is higher than 75%
for suicide, we would like to go with the next check-up
point.
- For
alcohol_impaired_driving_deathsvariable:
mosaic::favstats(alcohol_impaired_driving_deaths ~ state, data = chr_2022) |>
select(state, n, missing) |>
mutate(pct_available = 100*(n - missing)/n) |>
kable()
| state | n | missing | pct_available |
|---|---|---|---|
| MA | 14 | 0 | 100.00000 |
| MD | 24 | 0 | 100.00000 |
| NY | 62 | 0 | 100.00000 |
| OH | 88 | 0 | 100.00000 |
| PA | 67 | 0 | 100.00000 |
| VA | 127 | 6 | 95.27559 |
Since the pct_variable for each state is higher than 75%
for alcohol_impaired_driving_deaths, we would like to go
with the next check-up point.
2.1.4 Checking numbers of distinct values for each variable
(Part of Task 7) I would like to check if each chosen variable contains at least 10 distinct non-missing values.
chr_2022 |>
summarize(across(freq_mental_distress:alcohol_impaired_driving_deaths, ~ n_distinct(.))) |>
kable()
| freq_mental_distress | unemployment | food_insecurity | suicides | alcohol_impaired_driving_deaths |
|---|---|---|---|---|
| 90 | 388 | 125 | 365 | 230 |
Since all conditions are satisfied, I will keep working on my data selection.
2.2 Create Factors For Two of the Five Variables
For Task 4, I will pick two out of five chosen
variables in order to create factors for these variables. I chose
unemployment and freq_mental_distress to
create factors.
2.2.1 Creating Binary Categorical Variables
- For
freq_mental_distress, it will be named asmental_distress_leveland this variable will be used as binary categorical variables:low: frequency is below 13.2.high: frequency is equal to or higher than 13.2.
The cut-off (13.2) was determined based on the frequency mental distress in U.S value (https://www.americashealthrankings.org/explore/annual/measure/mental_distress).
chr_2022 <- chr_2022 |>
mutate(mental_distress_level = case_when(
freq_mental_distress < 13.2 ~ "low",
TRUE ~"high"),
mental_distress_level = factor(mental_distress_level))
After creating the mental_distress_level, I would like
to check whether they can be factored or not. In addition, I also want
to check if it has at least 10 rows for each level of factor or not.
levels(as.factor(chr_2022$mental_distress_level))
[1] "high" "low"
mosaic::favstats(freq_mental_distress ~ mental_distress_level, data = chr_2022) |> kable(digits = 3)
| mental_distress_level | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| high | 13.2 | 15.100 | 16.1 | 17.2 | 20.8 | 16.232 | 1.668 | 346 | 0 |
| low | 9.7 | 11.825 | 12.3 | 13.0 | 13.1 | 12.119 | 0.996 | 42 | 0 |
2.2.2 Creating a Three-Category Variable
- For
unemployment, I would name it asunemployment_leveland factor it into three different levels:low: percentage is under 5%normal: percentage is higher than 5% but lower than 7%.high: percentage is equal to or higher than 7%
Usually, it is said that the normal unemployment rate should be from 4% to 6%. Therefore, I would like to indicate that any unemployment percentage under 5%, which is the mean of the normal unemployment rate, should be considered a low employment rate. Then, the percentage that is higher than this range should be considered as a high unemployment rate. Therefore, I came up with the normal unemployment rate from 5% to 7%, and anything higher than this range would be a high rate.
chr_2022 <- chr_2022 |>
mutate(unemployment_level = case_when(
unemployment < 5 ~ "low",
unemployment < 7 ~ "normal",
TRUE ~ "high"),
unemployment_level = factor(unemployment_level))
I will do the same thing as above to check if it is possible to
factor in the unemployment_level column or not. If yes, I
also want to know whether it contains at least 10 rows at each
level.
levels(as.factor(chr_2022$unemployment_level))
[1] "high" "low" "normal"
mosaic::favstats(unemployment ~ unemployment_level, data = chr_2022) |> kable(digits = 3)
| unemployment_level | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| high | 7.004 | 7.790 | 8.487 | 9.405 | 16.031 | 8.743 | 1.309 | 239 | 0 |
| low | 3.713 | 4.479 | 4.621 | 4.835 | 4.954 | 4.542 | 0.385 | 24 | 0 |
| normal | 5.074 | 5.614 | 6.129 | 6.544 | 6.984 | 6.084 | 0.579 | 125 | 0 |
2.2.3 Checking factors of
state variable
(Part of Task 7) Finally, I would like to check if
the state variable is already factored as well as if it
contains at least 10 counties for each level, or state.
levels(as.factor(chr_2022$state))
[1] "MA" "MD" "NY" "OH" "PA" "VA"
chr_2022 |> tabyl(state) |> adorn_pct_formatting() |> kable()
| state | n | percent |
|---|---|---|
| MA | 14 | 3.6% |
| MD | 24 | 6.2% |
| NY | 62 | 16.0% |
| OH | 88 | 22.7% |
| PA | 67 | 17.3% |
| VA | 133 | 34.3% |
2.3 Structure and Summarize of My Tibble
(Task 5) This section would print out the structure of my tibble. I’m checking to see that:
The first rows tell me that this is a tibble with specification about its dimensions
I will have a complete set of 388 rows
I have included only 11 variables which include:
- Three required variables (
fipscode,county, andstate). For thefipscodeandcounty, it should be characterized as<chr>. Forstate, it should be considered asFactorvariables with an appropriate number of levels,6, followed by numerical code.
- My five original chosen variables should be specified as
<num>. - My other two categorical variables,
unemployment_rateandmental_distress_levelare characterized asFactor, followed by numerical codes.
str(chr_2022)tibble [388 × 11] (S3: tbl_df/tbl/data.frame) $ fipscode : chr [1:388] "24001" "24003" "24005" "24009" ... $ state : Factor w/ 6 levels "MA","MD","NY",..: 2 2 2 2 2 2 2 2 2 2 ... $ county : chr [1:388] "Allegany County" "Anne Arundel County" "Baltimore County" "Calvert County" ... $ county_ranked : num [1:388] 1 1 1 1 1 1 1 1 1 1 ... $ freq_mental_distress : num [1:388] 16.5 12.2 13.8 13.1 16.3 13.1 14.9 13.1 15.8 12.8 ... $ unemployment : num [1:388] 7.78 5.83 6.83 5.23 5.55 ... $ food_insecurity : num [1:388] 16.9 9 10.7 8.2 13.3 9.6 12.4 6.9 14.7 9.7 ... $ suicides : num [1:388] 16.2 12.3 10.6 10.4 14.3 ... $ alcohol_impaired_driving_deaths: num [1:388] 48.5 29.2 26.8 29.5 23.7 ... $ mental_distress_level : Factor w/ 2 levels "high","low": 1 2 1 2 1 2 1 2 1 2 ... $ unemployment_level : Factor w/ 3 levels "high","low","normal": 1 3 3 3 3 3 3 3 3 3 ...- Three required variables (
The following code snippet will save the tibble into .Rds file
saveRDS(chr_2022, "chr_2022_Quynh_Nguyen.rds")
We can try to load it with this command
chr_2022_load <- readRDS("chr_2022_Quynh_Nguyen.rds")
chr_2022_load
# A tibble: 388 × 11
fipscode state county count…¹ freq_…² unemp…³ food_…⁴ suici…⁵ alcoh…⁶ menta…⁷
<chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 24001 MD Alleg… 1 16.5 7.78 16.9 16.2 48.5 high
2 24003 MD Anne … 1 12.2 5.83 9 12.3 29.2 low
3 24005 MD Balti… 1 13.8 6.83 10.7 10.6 26.8 high
4 24009 MD Calve… 1 13.1 5.23 8.2 10.4 29.5 low
5 24011 MD Carol… 1 16.3 5.55 13.3 14.3 23.7 high
6 24013 MD Carro… 1 13.1 5.09 9.6 12.5 24.1 low
7 24015 MD Cecil… 1 14.9 5.94 12.4 17.1 25.4 high
8 24017 MD Charl… 1 13.1 6.72 6.9 10.5 35.4 low
9 24019 MD Dorch… 1 15.8 6.67 14.7 13.7 30.4 high
10 24021 MD Frede… 1 12.8 5.94 9.7 12.3 27.4 low
# … with 378 more rows, 1 more variable: unemployment_level <fct>, and
# abbreviated variable names ¹county_ranked, ²freq_mental_distress,
# ³unemployment, ⁴food_insecurity, ⁵suicides,
# ⁶alcohol_impaired_driving_deaths, ⁷mental_distress_level
Or we can save it as .csv file
write.csv(chr_2022, file ="chr_2022_Quynh_Nguyen.csv", col.names = TRUE)
Warning in write.csv(chr_2022, file = "chr_2022_Quynh_Nguyen.csv", col.names =
TRUE): attempt to set 'col.names' ignored
3 Codebook
(Task 6) This section will display the codebooks for each name of variables in my tibble and their definitions. This section will contain two parts: (1) for states and (2) for variables.
3.1 Codebook for states
| State Name | Abbreviation | Numbers of counties |
|---|---|---|
| Massachusetts | MA | 14 |
| Maryland | MD | 24 |
| New York | NY | 62 |
| Ohio | OH | 88 |
| Pennsylvania | PA | 67 |
| Virginia | VA | 115 |
| Total | - | 388 |
3.2 Codebook for variables
| Variable | Original code | Description | Number of missing |
|---|---|---|---|
fipscode |
N/A | FIPS code | 0 |
state |
N/A | Including 6 chosen states: MA, MD, NJ, NY, OH, PA | 0 |
county |
N/A | County name: including 388 counties from 6 chosen states | 0 |
county_ranked |
N/A | It is county ranked. All of them should equals 1 | 0 |
freq_mental_distress |
v145 | Frequent Mental Distress Rate | |
unemployment |
v023 | Unemployment Rate | 0 |
food_insecurity |
v139 | Food Insecurity Rate | 0 |
suicides |
v161_rawvalue | Suicides raw value | 24 |
alcohol_impaired_driving_deaths |
v134 | Alcohol-impaired driving deaths | |
unemployment_level |
N/A | 3 levels: low = unemployment < 5%; normal = 5% <= unemployment rate < 7%; high = unemployment >= 7% | 0 |
mental_distress_level |
N/A | 2 levels: low = freq_mental_distress < 13.2%, and high for freq_mental_distress >= 13.2% | 0 |
More details about the five chosen variables are specified below:
freq_mental_distresswas originally variablev145_rawvalue, and it is listed in the Quality of Life subcategory of Health Outcome at County Health Rankings. It describes the percentage of adults reporting 14 or more days of poor mental health per month. This data was obtained from Behavioral Risk Factor Surveillance System from 2019. This might be my outcome variable.unemploymentwas originally variablev023_rawvalue, and it is listed in the Unemployment of Social & Economic Factors subcategory of Health Factors at County Health Rankings. It describes percentage of population ages 16 and older unemployed but seeking work. This data was obtained from Bureau of Labor Statistics from 2020.food_insecuritywas originally variablev139_rawvalue, and it is listed in the Diet and Exercise of Health Behaviors subcategory of Health Factors at County Health Rankings. It describes percentage of population who lack adequate access to food.. This data was obtained from Map the Meal Gap from 2019.suicideswas originally variablev161_rawvalue, and it is listed in the Community Safety of Social & Economic Factors subcategory of Health Factors at County Health Rankings. It describes number of deaths due to suicide per 100,000 population (age-adjusted). This data was obtained from National Center for Health Statistics - Mortality Files from 2016 - 2020.alcohol_impaired_driving_deathswas originally variablev34_rawvalue, and it is listed in the Alcohol and Drug Use of Health Behaviors subcategory of Health Factors at County Health Rankings. It describes percentage of driving deaths with alcohol involvement.. This data was obtained from Fatality Analysis Reporting System from 2016 - 2020. # Print and Summarize Tibble
(Task 7)
3.3 Print the Tibble
chr_2022
# A tibble: 388 × 11
fipscode state county count…¹ freq_…² unemp…³ food_…⁴ suici…⁵ alcoh…⁶ menta…⁷
<chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 24001 MD Alleg… 1 16.5 7.78 16.9 16.2 48.5 high
2 24003 MD Anne … 1 12.2 5.83 9 12.3 29.2 low
3 24005 MD Balti… 1 13.8 6.83 10.7 10.6 26.8 high
4 24009 MD Calve… 1 13.1 5.23 8.2 10.4 29.5 low
5 24011 MD Carol… 1 16.3 5.55 13.3 14.3 23.7 high
6 24013 MD Carro… 1 13.1 5.09 9.6 12.5 24.1 low
7 24015 MD Cecil… 1 14.9 5.94 12.4 17.1 25.4 high
8 24017 MD Charl… 1 13.1 6.72 6.9 10.5 35.4 low
9 24019 MD Dorch… 1 15.8 6.67 14.7 13.7 30.4 high
10 24021 MD Frede… 1 12.8 5.94 9.7 12.3 27.4 low
# … with 378 more rows, 1 more variable: unemployment_level <fct>, and
# abbreviated variable names ¹county_ranked, ²freq_mental_distress,
# ³unemployment, ⁴food_insecurity, ⁵suicides,
# ⁶alcohol_impaired_driving_deaths, ⁷mental_distress_level
3.4 Report of Numerical Summaries
Then, I would like to report the numerical summaries.
Hmisc::describe(chr_2022)
chr_2022
11 Variables 388 Observations
--------------------------------------------------------------------------------
fipscode
n missing distinct
388 0 388
lowest : 24001 24003 24005 24009 24011, highest: 51800 51810 51820 51830 51840
--------------------------------------------------------------------------------
state
n missing distinct
388 0 6
lowest : MA MD NY OH PA, highest: MD NY OH PA VA
Value MA MD NY OH PA VA
Frequency 14 24 62 88 67 133
Proportion 0.036 0.062 0.160 0.227 0.173 0.343
--------------------------------------------------------------------------------
county
n missing distinct
388 0 316
lowest : Accomack County Adams County Albany County Albemarle County Alexandria city
highest: Wyandot County Wyoming County Wythe County Yates County York County
--------------------------------------------------------------------------------
county_ranked
n missing distinct Info Mean Gmd
388 0 1 0 1 0
Value 1
Frequency 388
Proportion 1
--------------------------------------------------------------------------------
freq_mental_distress
n missing distinct Info Mean Gmd .05 .10
388 0 90 1 15.79 2.306 12.30 13.10
.25 .50 .75 .90 .95
14.50 15.95 17.10 18.33 19.10
lowest : 9.7 9.8 10.1 10.2 10.3, highest: 20.1 20.2 20.4 20.5 20.8
--------------------------------------------------------------------------------
unemployment
n missing distinct Info Mean Gmd .05 .10
388 0 388 1 7.627 2.032 4.865 5.276
.25 .50 .75 .90 .95
6.379 7.628 8.778 9.778 10.456
lowest : 3.712672 3.790439 3.832466 3.880440 4.176668
highest: 12.532384 12.533597 12.573674 13.946058 16.030909
--------------------------------------------------------------------------------
food_insecurity
n missing distinct Info Mean Gmd .05 .10
388 0 125 1 11.55 3.443 6.335 7.570
.25 .50 .75 .90 .95
9.500 11.600 13.400 15.430 17.065
lowest : 4.0 4.8 4.9 5.0 5.5, highest: 19.1 19.5 19.7 19.8 20.3
--------------------------------------------------------------------------------
suicides
n missing distinct Info Mean Gmd .05 .10
364 24 364 1 15.58 5.588 8.280 9.768
.25 .50 .75 .90 .95
12.077 15.014 18.638 21.598 24.377
lowest : 5.272063 5.324782 5.823285 6.109052 6.143075
highest: 31.216651 31.466973 32.802466 38.058005 43.755493
--------------------------------------------------------------------------------
alcohol_impaired_driving_deaths
n missing distinct Info Mean Gmd .05 .10
382 6 229 1 28.29 12.4 10.54 15.62
.25 .50 .75 .90 .95
21.43 27.61 33.33 41.38 47.04
lowest : 0.000000 2.500000 3.225806 4.166667 4.347826
highest: 58.181818 63.636364 76.923077 77.777778 100.000000
--------------------------------------------------------------------------------
mental_distress_level
n missing distinct
388 0 2
Value high low
Frequency 346 42
Proportion 0.892 0.108
--------------------------------------------------------------------------------
unemployment_level
n missing distinct
388 0 3
Value high low normal
Frequency 239 24 125
Proportion 0.616 0.062 0.322
--------------------------------------------------------------------------------
3.4.1 For
freq_mental_distress variable
mosaic::favstats(~freq_mental_distress, data = chr_2022) |>
kable(digits = 3)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 9.7 | 14.5 | 15.95 | 17.1 | 20.8 | 15.787 | 2.055 | 388 | 0 |
3.4.2 For
unemployment variable
mosaic::favstats(~unemployment, data = chr_2022) |>
kable(digits = 3)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 3.713 | 6.379 | 7.628 | 8.778 | 16.031 | 7.627 | 1.816 | 388 | 0 |
3.4.3 For
food_insecurity variable
mosaic::favstats(~food_insecurity, data = chr_2022) |>
kable(digits = 3)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 9.5 | 11.6 | 13.4 | 20.3 | 11.555 | 3.059 | 388 | 0 |
3.4.4 For
suicides variable
mosaic::favstats(~suicides, data = chr_2022) |>
kable(digits = 3)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 5.272 | 12.077 | 15.014 | 18.638 | 43.755 | 15.584 | 5.164 | 364 | 24 |
3.4.5 For
alcohol_impaired_driving_deaths variable
mosaic::favstats(~alcohol_impaired_driving_deaths, data = chr_2022) |>
kable(digits = 3)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 21.429 | 27.609 | 33.333 | 100 | 28.292 | 11.753 | 382 | 6 |
3.4.6 For
mental_distress_level variable
chr_2022 |> tabyl(mental_distress_level) |> adorn_pct_formatting() |> kable()
| mental_distress_level | n | percent |
|---|---|---|
| high | 346 | 89.2% |
| low | 42 | 10.8% |
3.4.7 For
unemployment_level variable
chr_2022 |> tabyl(unemployment_level) |> adorn_pct_formatting() |> kable()
| unemployment_level | n | percent |
|---|---|---|
| high | 239 | 61.6% |
| low | 24 | 6.2% |
| normal | 125 | 32.2% |
3.4.8 For
state variable
I already included this in
2.2.3: Checking factors of state variable
4 Report Writeup
I did my project proposal by following it step-by-step as it was listed on the class website. However, I think that caused me a lot of trouble. First of all, I only saw other requirements for my data selection until I reached the end of the report. In the original work, instead of choosing Virginia, I chose Georgia. This original work resulted in more states and added more rows for me. However, when checking about the available percentage of non-missing values for the variable that contains missing values, it turned out this percentage was under 75% for Georgia. Therefore, I had to make another state selection and chose Virginia as an alternative selection. However, when I replaced it with Virginia, it gave me another variable that contains missing value instead only having one variable that has missing value as my original work. I did not recognize this in the first place until I went through my result again, which made me add another small code snippet. Therefore, I wish these checks could be placed somewhere at the beginning instead of in the latest task. Besides this part, I feel the instruction was well-organized and easy to follow.
5 Analyses 1
5.1 The Variables
- The outcome in this analysis should be the
suicides. As indicated above, the suicides were in percentage, so its unit would be%. Since our outcome has some missing values, we would use complete case on it. - Besides two categorical predictor variables,
unemploymentandfreq_mental_distress, the other two predictors arealcohol_impaired_driving_deathsandfood_insecurity. In this analysis, I would like to usefood_insecurityas the predictor variables. - From the table above,
food _insecurityhas no missing values, and it was calculated as percentage. Therefore,food_insecurityunit woule be%. - The following is the tibble of the variables.
data1 <- chr_2022 %>% select('food_insecurity', 'suicides', 'state', 'county')
miss_var_summary(data1) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| suicides | 24 | 6.185567 |
| food_insecurity | 0 | 0.000000 |
| state | 0 | 0.000000 |
| county | 0 | 0.000000 |
#complete case for the suicides
data1 <- data1 %>% filter(complete.cases(data1))
data1
# A tibble: 364 × 4
food_insecurity suicides state county
<dbl> <dbl> <fct> <chr>
1 16.9 16.2 MD Allegany County
2 9 12.3 MD Anne Arundel County
3 10.7 10.6 MD Baltimore County
4 8.2 10.4 MD Calvert County
5 13.3 14.3 MD Caroline County
6 9.6 12.5 MD Carroll County
7 12.4 17.1 MD Cecil County
8 6.9 10.5 MD Charles County
9 14.7 13.7 MD Dorchester County
10 9.7 12.3 MD Frederick County
# … with 354 more rows
The structure of the table, which have
str(data1)
tibble [364 × 4] (S3: tbl_df/tbl/data.frame)
$ food_insecurity: num [1:364] 16.9 9 10.7 8.2 13.3 9.6 12.4 6.9 14.7 9.7 ...
$ suicides : num [1:364] 16.2 12.3 10.6 10.4 14.3 ...
$ state : Factor w/ 6 levels "MA","MD","NY",..: 2 2 2 2 2 2 2 2 2 2 ...
$ county : chr [1:364] "Allegany County" "Anne Arundel County" "Baltimore County" "Calvert County" ...
The number of counties that have complete cases for both variables are 364 counties.
Value of predictor (food_insecurity) and outcome
(suicides) for Cuyahoga County:
food_insecurity is 13.9% and suicides is
13.41873.
data1[data1$state =="OH" & data1$county=="Cuyahoga County",] |> kable()
| food_insecurity | suicides | state | county |
|---|---|---|---|
| 13.9 | 13.41873 | OH | Cuyahoga County |
5.2 Research Question
What is the nature of the association between the
food insecurity and suicides, in 364 counties
across 6 different states?
5.3 Visualizing the Data
ggplot(data1, aes(x=food_insecurity, y =suicides)) + geom_point() +
labs(x ="Food Insecurity (%)",
y = "Suicides (%)",
title = "How closely associated are food insecurity and suicides?",
subtitle = "Food insecurity as predictors and suicides as outcomes")
Based on the figure above, it seems like they are somewhat having a
positive association with each other as the food_insecurity
percentage increases, the suicides percentage also
increase.
ggplot(data1, aes(x=food_insecurity, y =suicides)) + geom_point(alpha = 0.6) +
geom_smooth(method="lm", col ="red", se = FALSE, formula = y ~ x) +
geom_smooth(method="loess", col ="blue", se=FALSE, formula = y ~x) +
labs(x ="Food Insecurity (%)",
y = "Suicides (%)",
title = "Positive association between food insecurity and suicides",
subtitle = "OLS model in red, loess smooth in blue")
From the plot, it seems like they actually have a positive association
between
food_insecurity and suicides as points
follow the straight line’s path. Moreover, there are couples of
outliners in the graph. In addition, the association is not a stronf
correlation.
5.4 Transformation Assessment
In order to investigate as if I should do the transformation or not,
I would like to investigate suicides variable, which is my
outcome.
p1 <- ggplot(data1, aes(sample = suicides)) +
geom_qq() + # plot the points
geom_qq_line(col = "blue") + # plot the Y = X line
theme(aspect.ratio = 1) + # make the plot square
labs(title = "Normal Q-Q plot: Simulated suicides")
p2 <- ggplot(data1, aes(x=suicides)) +
geom_histogram(aes(y=stat(density)), bins = 20, fill ="royalblue", col ="white") +
stat_function(fun = dnorm, args = list(mean = mean(data1$suicides), sd = sd(data1$suicides)), col ="red", lwd = 1.5) +
labs(title="Density Function: Suicides")
p3 <- ggplot(data1, aes(x = suicides, y = "")) +
geom_boxplot(fill = "royalblue",
outlier.color = "royalblue") +
labs(title = "Boxplot:Suicides", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ suicides, data = data1) %>% kable(digits = 1)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 5.3 | 12.1 | 15 | 18.6 | 43.8 | 15.6 | 5.2 | 364 | 0 |
As the above graph, we can see that our suicides has
right-skewed distribution.
p1 <- ggplot(data1, aes(sample = food_insecurity)) +
geom_qq() + # plot the points
geom_qq_line(col = "blue") + # plot the Y = X line
theme(aspect.ratio = 1) + # make the plot square
labs(title = "Normal Q-Q plot: Simulated food insecurity")
p2 <- ggplot(data1, aes(x=food_insecurity)) +
geom_histogram(aes(y=stat(density)), bins = 20, fill ="royalblue", col ="white") +
stat_function(fun = dnorm, args = list(mean = mean(data1$food_insecurity), sd = sd(data1$food_insecurity)), col ="red", lwd = 1.5) +
labs(title="Density Function: Food insecurity")
p3 <- ggplot(data1, aes(x = food_insecurity, y = "")) +
geom_boxplot(fill = "royalblue",
outlier.color = "royalblue") +
labs(title = "Boxplot: Food insecurity", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ food_insecurity, data = data1) %>% kable(digits = 1)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 9.5 | 11.6 | 13.3 | 20.3 | 11.5 | 3 | 364 | 0 |
Comparing to the suicides, food_insecurity
is normally distributed.
Because of this, I want to investigate the transformation on
suicides variable by applying logarithm base 10 on this
variable.
data2 <- mutate(data1, suicides = log(suicides))
p1 <- ggplot(data2, aes(sample = suicides)) +
geom_qq() + # plot the points
geom_qq_line(col = "blue") + # plot the Y = X line
theme(aspect.ratio = 1) + # make the plot square
labs(title = "Normal Q-Q plot: Simulated suicides")
p2 <- ggplot(data2, aes(x=suicides)) +
geom_histogram(aes(y=stat(density)), bins = 20, fill ="royalblue", col ="white") +
stat_function(fun = dnorm, args = list(mean = mean(data2$suicides), sd = sd(data2$suicides)), col ="red", lwd = 1.5) +
labs(title="Density Function: Suicides")
p3 <- ggplot(data2, aes(x = suicides, y = "")) +
geom_boxplot(fill = "royalblue",
outlier.color = "royalblue") +
labs(title = "Boxplot: Suicides", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ suicides, data = data1) %>% kable(digits = 1)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 5.3 | 12.1 | 15 | 18.6 | 43.8 | 15.6 | 5.2 | 364 | 0 |
After applying transformation on the suicides variable,
it looks more normally distributed comparing being right-skewed
distribution as before. Therefore, I decide to perform transformation by
using logarithm on on my suicides outcome.
Due to this, the association between food_insecurity and
suicides will be redone.
ggplot(data2, aes(x=food_insecurity, y =suicides)) + geom_point(alpha = 0.6) +
geom_smooth(method="lm", col ="red", se = FALSE, formula = y ~ x) +
geom_smooth(method="loess", col ="blue", se=FALSE, formula = y ~x) +
labs(x ="Food Insecurity (%)",
y = "Log of Suicides Percentage",
title = "Positive association between food insecurity and log of suicides percentage",
subtitle = "OLS model in red, loess smooth in blue")
5.5 The Fitted Model
Now, we would like to create our fitted model:
model1 <- lm(suicides ~ food_insecurity, data = data2)
broom::tidy(model1, conf.int = TRUE, conf.level = 0.90) %>% select(term, estimate, conf.low, conf.high, p.value ) %>%
kable(digits = 3)
| term | estimate | conf.low | conf.high | p.value |
|---|---|---|---|---|
| (Intercept) | 2.225 | 2.120 | 2.330 | 0 |
| food_insecurity | 0.040 | 0.032 | 0.049 | 0 |
Based on the summary of the model, the coefficients are:
food_insecurity = 0.04 and
intercept is 2.23. Because the coefficient for
food_insecurity is larger than 0, it means that we have a
positive association between
food_insecurity and suicides. The intercept
coefficient means where our line will intersects with the y-axis.This
coefficients gives equation for my model is Log(Suicides) = 0.04 *
food_insecurity + 2.23. In addition, these coefficients are
significant as the low confidences of the interval are above 0 for both
food_insecurity and intercept.
broom::glance(model1) %>% select(nobs, r.squared, adj.r.squared, sigma) %>% kable(digits = 3)
| nobs | r.squared | adj.r.squared | sigma |
|---|---|---|---|
| 364 | 0.137 | 0.135 | 0.309 |
From the table above, it indicates that the R-squared is 0.137 and residual standard error is 0.309.
Finally, we obtain the Pearson correlatio is 0.37
cor(y = data2$suicides, x = data2$food_insecurity)
[1] 0.3704226
5.6 Residual Analysis
Here is using the augment function to get the fitted and
residual values.
model1aug <- broom::augment(model1, data = data2)
Here is plot for Normality in the residuals
p1 <- ggplot(model1aug, aes(sample = .resid)) +
geom_qq(col = "seagreen") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q: Model 1 Residuals")
p2 <- ggplot(model1aug, aes(x = .resid)) +
geom_histogram(aes(y = stat(density)),
bins = 20, fill = "seagreen", col = "yellow") +
stat_function(fun = dnorm,
args = list(mean = mean(model1aug$.resid),
sd = sd(model1aug$.resid)),
col = "black", lwd = 1.5) +
labs(title = "Hist + Normal Density: Model 1 Residuals")
p3 <- ggplot(model1aug, aes(x = .resid, y = "")) +
geom_boxplot(fill = "seagreen", outlier.color = "seagreen") +
labs(title = "Boxplot: Model 1 Residuals", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ .resid, data = model1aug) %>% kable(digits = 1)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| -1.2 | -0.2 | 0 | 0.2 | 1.1 | 0 | 0.3 | 364 | 0 |
From the histogram, we might agree that it is normally distributed. However, we can notice that each end of the histogram extends much further than the normal distribution, so it can be considered as heavy-tailed distribution. And this can be confirmed by our QQ-Plot. However, with that, the Normal model may still be reasonable for a batch of data.
And here is the plot for assess residuals vs fitted values for non-linearty
ggplot(model1aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = FALSE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(title = "Model 1: Residuals vs. Fitted Values",
x = "Fitted suicides values", y = "Residuals")
From the plot, we do not see any significant curve when plotting the
residuals against the fitted values. However, this is not given the
‘fuzzy football’ pattern, which it seems like the variance across levels
of the fitted values are not really similar to each other.
Since I use transform model, I will create another model here to get the models’ prediction for the original outcome and compare to each actual outcome for Cuyahoga County in Ohio. The code is displayed below, with the predicted value as 17.05 while the actual value is 13.42
org <- lm(suicides ~ food_insecurity, data = data1)
orgaug <- broom::augment(org, data = data1)
tibble(actual_value = data1[data1$state=="OH" & data1$county =="Cuyahoga County",]$suicides, predicted_value = orgaug[orgaug$state=="OH" & orgaug$county =="Cuyahoga County",]$.fitted) |> kable(digits = 3)
| actual_value | predicted_value |
|---|---|
| 13.419 | 17.054 |
The two counties have the least successful at predicting outcome are Bronx County in New York with absolute residual values are 1.22 and Kings County in New York with absolute residual value as 1.11
model1augorder <- mutate(model1aug, .resid = abs(.resid))
model1augorder <- model1augorder[order(-model1augorder$.resid),]
model1augorder[1:2,] |> kable(digits = 3)
| food_insecurity | suicides | state | county | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|
| 16.4 | 1.672 | NY | Bronx County | 2.889 | 1.217 | 0.010 | 0.303 | 0.077 | -3.954 |
| 13.5 | 1.662 | NY | Kings County | 2.772 | 1.110 | 0.004 | 0.304 | 0.025 | -3.594 |
5.7 Conclusions and Limitations
For this analysis, we are interested in the nature of association
between the food insecurity and suicides in 364 counties. From the
analysis above, instead of using the percentage of suicides, we use its
log to make sure it has the normal distribution because the original
values are right-skewed. Based on the analysis, it appears that there is
a positive association between the food insecurity and suicides. It
appears that the association is moderate, as the Pearson’s correlation
is only 0.37. However, it appears to be a significant linear regression
as both the estimation for the food insecurity and suicides have their
confidence low level are above 0. With that, the association between the
food insecurity and suicides can be expressed as:
Log(Suicides) = 0.04 * food_insecurity + 2.23.
From the residual plot, we actually don’t see any ‘fuzzy football’ pattern, which is understandable since the distribution of our residuals are not normality as it is a heavy-tailed distribution. Therefore, we can conclude that the variance variance across levels of the fitted values are not really similar to each other. In addition, another limitations in this study is that the states I chose cannot use to be a representative sample of the U.S. Since I selected it based on my favorite, instead of making random choices, and they all place in the Eastern of the U.S. If I want to make a representative sample of the U.S, I would better choose them randomly or I should pick one state from different coast.
6 Analysis 2
6.1 The variables
I choose unemployment and
freq_mental_distress to be my two categorical predictors.
In this report, I would like to use freq_mental_distress as
my choice for categorical predictor. It was calculated in percentage
(%) .The outcome would be the same, which is
suicides (%)
data3 <- chr_2022 %>% select('freq_mental_distress', 'mental_distress_level' ,'suicides', 'state', 'county')
miss_var_summary(data3) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| suicides | 24 | 6.185567 |
| freq_mental_distress | 0 | 0.000000 |
| mental_distress_level | 0 | 0.000000 |
| state | 0 | 0.000000 |
| county | 0 | 0.000000 |
#complete case for the suicides
data3 <- data3 %>% filter(complete.cases(data3))
data3
# A tibble: 364 × 5
freq_mental_distress mental_distress_level suicides state county
<dbl> <fct> <dbl> <fct> <chr>
1 16.5 high 16.2 MD Allegany County
2 12.2 low 12.3 MD Anne Arundel County
3 13.8 high 10.6 MD Baltimore County
4 13.1 low 10.4 MD Calvert County
5 16.3 high 14.3 MD Caroline County
6 13.1 low 12.5 MD Carroll County
7 14.9 high 17.1 MD Cecil County
8 13.1 low 10.5 MD Charles County
9 15.8 high 13.7 MD Dorchester County
10 12.8 low 12.3 MD Frederick County
# … with 354 more rows
The structure of the table, which have
str(data3)
tibble [364 × 5] (S3: tbl_df/tbl/data.frame)
$ freq_mental_distress : num [1:364] 16.5 12.2 13.8 13.1 16.3 13.1 14.9 13.1 15.8 12.8 ...
$ mental_distress_level: Factor w/ 2 levels "high","low": 1 2 1 2 1 2 1 2 1 2 ...
$ suicides : num [1:364] 16.2 12.3 10.6 10.4 14.3 ...
$ state : Factor w/ 6 levels "MA","MD","NY",..: 2 2 2 2 2 2 2 2 2 2 ...
$ county : chr [1:364] "Allegany County" "Anne Arundel County" "Baltimore County" "Calvert County" ...
The number of counties that have complete cases for both variables are 364 counties.
Value of predictor (freq_mental_distress) and outcome
(suicides) for Cuyahoga County: freq_mental_distress is
16.7% and suicides is 13.419%.
data3[data3$state =="OH" & data3$county=="Cuyahoga County",] |> kable(digits = 3)
| freq_mental_distress | mental_distress_level | suicides | state | county |
|---|---|---|---|---|
| 16.7 | high | 13.419 | OH | Cuyahoga County |
6.2 Research Question
Does low or high frequency of mental distress show higher levels of suicides, in 364 counties in the states of?
6.3 Visualizing Data
From the first analysis, we already know that the original
distribution of suicides - our outcome - is right-skewed.
Due to the normal assumption, I decide to use transformation by using
logarithm again on the outcome.
data4 <- mutate(data3, suicides = log(suicides))
str(data4)
tibble [364 × 5] (S3: tbl_df/tbl/data.frame)
$ freq_mental_distress : num [1:364] 16.5 12.2 13.8 13.1 16.3 13.1 14.9 13.1 15.8 12.8 ...
$ mental_distress_level: Factor w/ 2 levels "high","low": 1 2 1 2 1 2 1 2 1 2 ...
$ suicides : num [1:364] 2.79 2.51 2.36 2.34 2.66 ...
$ state : Factor w/ 6 levels "MA","MD","NY",..: 2 2 2 2 2 2 2 2 2 2 ...
$ county : chr [1:364] "Allegany County" "Anne Arundel County" "Baltimore County" "Calvert County" ...
ggplot(data = data4, aes(x = mental_distress_level , y = suicides, color = mental_distress_level)) +
geom_boxplot() +
labs(title = "Suicides by Levels of Mental Distress",
subtitle = "Log of Percentage of Suicides",
x = "Levels of Mental Distress",
y = "Log of Percentage of Suicides") +
labs(color="Mental Distress Level")
From the box plot above, we can see a quite big shift of log of
percentage of suicides in
high level of mental distress
comparing to the low level of mental distress as the log of
percentage of suicides in high level of mental distress is
higher.
6.4 The Fitted Model
If we don’t consider about the baseline state:
model2 <- lm(suicides ~ freq_mental_distress, data = data4)
broom::tidy(model2, conf.int = TRUE, conf.level = 0.90) %>% select(term, estimate, conf.low, conf.high, p.value ) %>%
kable(digits = 3)
| term | estimate | conf.low | conf.high | p.value |
|---|---|---|---|---|
| (Intercept) | 1.415 | 1.224 | 1.606 | 0 |
| freq_mental_distress | 0.081 | 0.069 | 0.093 | 0 |
Based on the summary of the model, the coefficients are:
freq_mental_distress = 0.081 and
intercept is 2.23. Because the coefficient for
freq_mental_distress is larger than 0, it means that we have a
positive association between
freq_mental_distress and suicides. The
intercept coefficient means where our line will intersects with the
y-axis. This coefficients gives equation for my model is
Log(Suicides) = 0.081 * freq_mental_distress + 1.42. In
addition, these coefficients are significant as the low confidences of
the interval are above 0 for both food_insecurity and intercept.
We also want to see how the association varies by the freq_mental_distress
data4 %>% group_by(mental_distress_level) %>% summarize(n = n(), pearson_r = cor(suicides, freq_mental_distress), r.squared = pearson_r^2) %>%
kable(digits = 3)
| mental_distress_level | n | pearson_r | r.squared |
|---|---|---|---|
| high | 324 | 0.369 | 0.137 |
| low | 40 | 0.125 | 0.016 |
From the table, only high level of mental distress has
higher Pearson’s correlation comparing to low. Therefore,
it is more fitted in this model comparing to the low.
For the baseline reference, I will choose “high” as default, and replace the previous model that we have:
model2 <- lm(suicides ~ freq_mental_distress + mental_distress_level, data = data4)
broom::tidy(model2, conf.int = TRUE, conf.level = 0.90) %>% select(term, estimate, conf.low, conf.high, p.value ) %>%
kable(digits = 3)
| term | estimate | conf.low | conf.high | p.value |
|---|---|---|---|---|
| (Intercept) | 1.700 | 1.453 | 1.946 | 0.000 |
| freq_mental_distress | 0.064 | 0.049 | 0.079 | 0.000 |
| mental_distress_levellow | -0.180 | -0.280 | -0.080 | 0.003 |
Based on the summary of the model, the coefficients are:
freq_mental_distress = 0.064 and
intercept is 1.7. The fitted model can be
written as: (1) for low level of mental distress, it is
Log(Suicides) = 0.064 * freq_mental_distress + 1.52 and (2)
for high level of mental distress, it is
Log(Suicides) = 0.064 * freq_mental_distress + 1.7 Because
the coefficient for freq_mental_distress is larger than 0, it means that
we have a positive association between
freq_mental_distress and suicides. The
intercept coefficient means where our line will intersects with the
y-axis. In addition, these coefficients are significant as the low
confidences of the interval are above 0 for both food_insecurity and
intercept. However, the baseline reference is not that significant
important, as the interval of coefficient for the baseline reference
includes 0.
broom::glance(model2) %>% select(nobs, r.squared, adj.r.squared, sigma) %>% kable(digits = 3)
| nobs | r.squared | adj.r.squared | sigma |
|---|---|---|---|
| 364 | 0.273 | 0.268 | 0.284 |
From the table above, it indicates that the R-squared is
0.273 and residual standard error is
0.284 with 364 observations
6.5 Prediction Analysis
Here is using the augment function to get the fitted and residual values.
model2aug <- broom::augment(model2, data = data4)
Here is the plot for Normality
p1 <- ggplot(model2aug, aes(sample = .resid)) +
geom_qq(col = "seagreen") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q: Model 2 Residuals")
p2 <- ggplot(model2aug, aes(x = .resid)) +
geom_histogram(aes(y = stat(density)),
bins = 20, fill = "seagreen", col = "yellow") +
stat_function(fun = dnorm,
args = list(mean = mean(model2aug$.resid),
sd = sd(model2aug$.resid)),
col = "black", lwd = 1.5) +
labs(title = "Hist + Normal Density: Model 2 Residuals")
p3 <- ggplot(model2aug, aes(x = .resid, y = "")) +
geom_boxplot(fill = "seagreen", outlier.color = "seagreen") +
labs(title = "Boxplot: Model 2 Residuals", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ .resid, data = model2aug) %>% kable(digits = 1)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| -1.2 | -0.2 | 0 | 0.2 | 1 | 0 | 0.3 | 364 | 0 |
From the histogram, we might agree that it is normally distributed. However, we can notice that each end of the histogram extends much further than the normal distribution, so it can be considered as heavy-tailed distribution. And this can be confirmed by our QQ-Plot. However, with that, the Normal model may still be reasonable for a batch of data.
ggplot(model2aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = FALSE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(title = "Model 2: Residuals vs. Fitted Values",
x = "Fitted suicides values", y = "Residuals")
From the plot, we can see the curve when plotting the residuals against
the fitted values. In addition, this is not given the ‘fuzzy football’
pattern, which it seems like the variance across levels of the fitted
values are not really similar to each other.
Since I use transform model, I will create another model here to get the models’ prediction for the original outcome and compare to each actual outcome for Cuyahoga County in Ohio. The code is displayed below, with the predicted value as 16.70 while the actual value is 16.723
org2 <- lm(suicides ~ freq_mental_distress + mental_distress_level, data = data3)
org2aug <- broom::augment(org2, data = data3)
tibble(actual_value = data3[data3$state=="OH" & data3$county =="Cuyahoga County",]$freq_mental_distress, predicted_value = org2aug[org2aug$state=="OH" & org2aug$county =="Cuyahoga County",]$.fitted) |> kable(digits = 3)
| actual_value | predicted_value |
|---|---|
| 16.7 | 16.723 |
The two counties have the least successful at predicting outcome are Holmes County in Ohio with absolute residual values are 1.239 and Covington city in Virginia with absolute residual value as 0.992
model2augorder <- mutate(model2aug, .resid = abs(.resid))
model2augorder <- model2augorder[order(-model2augorder$.resid),]
model2augorder[1:2,] |> kable(digits = 3)
| freq_mental_distress | mental_distress_level | suicides | state | county | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|---|
| 20.2 | high | 1.815 | OH | Holmes County | 2.999 | 1.183 | 0.020 | 0.278 | 0.119 | -4.203 |
| 15.3 | high | 1.672 | NY | Bronx County | 2.684 | 1.011 | 0.004 | 0.280 | 0.017 | -3.563 |
6.6 Conclusion and Limitations
In this section, we want to know which levels of frequency of mental distress show higher levels of suicides in 364 counties in 6 states that we are investigated. From the box plot and the R-squared, it seems like the high level of mental distress has a higher positive association with the suicides.In addition, we also use ‘high’ as our baseline for the model. However, after our analysis, we would like to say that the even there are significant in positive association between the frequency of mental distress and suicides, the levels of the mental distress doesn’t really matter. The limitations are same as the above, which my states cannot be a presentative for the whole U.S in total as it was not picked randomly. Since the variance of residuals are not similarity to each otheras well as the normality of residuals is very heavy-tailed, I don’t think we have a good model afterall.
7 Analysis 3
7.1 The Variables
- The outcome in this analysis should be the
suicides. As indicated above, the suicides were in percentage, so its unit would be%. Since our outcome has some missing values, we would use complete case on it. - Besides two categorical predictor variables,
unemploymentandfreq_mental_distress, the other two predictors arealcohol_impaired_driving_deathsandfood_insecurity. In this analysis, I would like to usefood_insecurityas the predictor variables. - From the table above,
food _insecurityhas no missing values, and it was calculated as percentage. Therefore,food_insecurityunit woule be%. - The following is the tibble of the variables:
data1 <- chr_2022 %>% select('food_insecurity', 'suicides', 'state', 'county')
miss_var_summary(data1) |> kable()
| variable | n_miss | pct_miss |
|---|---|---|
| suicides | 24 | 6.185567 |
| food_insecurity | 0 | 0.000000 |
| state | 0 | 0.000000 |
| county | 0 | 0.000000 |
data1 <- data1 %>% filter(complete.cases(data1))
data1
# A tibble: 364 × 4
food_insecurity suicides state county
<dbl> <dbl> <fct> <chr>
1 16.9 16.2 MD Allegany County
2 9 12.3 MD Anne Arundel County
3 10.7 10.6 MD Baltimore County
4 8.2 10.4 MD Calvert County
5 13.3 14.3 MD Caroline County
6 9.6 12.5 MD Carroll County
7 12.4 17.1 MD Cecil County
8 6.9 10.5 MD Charles County
9 14.7 13.7 MD Dorchester County
10 9.7 12.3 MD Frederick County
# … with 354 more rows
The structure of the table, which have
str(data1)
tibble [364 × 4] (S3: tbl_df/tbl/data.frame)
$ food_insecurity: num [1:364] 16.9 9 10.7 8.2 13.3 9.6 12.4 6.9 14.7 9.7 ...
$ suicides : num [1:364] 16.2 12.3 10.6 10.4 14.3 ...
$ state : Factor w/ 6 levels "MA","MD","NY",..: 2 2 2 2 2 2 2 2 2 2 ...
$ county : chr [1:364] "Allegany County" "Anne Arundel County" "Baltimore County" "Calvert County" ...
The number of counties that have complete cases for both variables are 364 counties.
Value of predictor (food_insecurity) and outcome
(suicides) for Cuyahoga County, OH:
food_insecurity is 13.9% and
suicides is 13.41873.
7.2 Research Question
How well does food insecurity predict suicides after accounting for differences between states?
7.3 Visualizing the Data
From the first analysis, we already know that the original distribution of suicides - our outcome - is right-skewed. Due to the normal assumption, I decide to use transformation by using logarithm again on the outcome.
data2 <- mutate(data1, suicides = log(suicides))
The following graph represents all the association plot between
food_insecurity and suicides of all states
together.
ggplot(data2, aes(x=food_insecurity, y =suicides, color=state)) + geom_point() +
geom_smooth(method="lm", col ="red", se = FALSE, formula = y ~ x) +
geom_smooth(method="loess", col ="blue", se=FALSE, formula = y ~x) +
labs(x ="Food Insecurity (%)",
y = "Log of Suicides Percentage",
title = "Positive association between food insecurity and log of suicides percentage",
subtitle = "OLS model in red, loess smooth in blue") +
labs(color="State")
Since this one is the same as the one in Analysis 1, we already know
that there is a positive association between
food insecurity and suicides. However, it
seems like different states have different representation. For example,
we see most of pink dots, representing for Virginia, are clustering more
on the left side. The red dots, representing for Massachusetts, are more
likely to be on the left, below the line. And the green and blue dots,
representing for New York and Ohio, respectively, are more likely to
accumulate in the middle. Yellow dots, for Maryland, is spreading
around.
This graph will represent the association between
food_insecurity and suicides by each
state.
ggplot(data2, aes(x=food_insecurity, y =suicides, color = state)) +
geom_point() +
geom_smooth(method="lm", col ="red", se = FALSE, formula = y ~ x) +
geom_smooth(method="loess", col ="blue", se=FALSE, formula = y ~x) +
facet_wrap(state ~ ., ncol = 3) +
labs(x ="Food Insecurity (%)",
y = "Log of Suicides Percentage",
title = "Association between food insecurity and log of suicides percentage",
subtitle = "OLS model in red, loess smooth in blue", color = "State")
When we view each of those states separately, we can see that all the
states, except for Massachusetts, have a positive association for our
variables. For Massachusetts, it is a weak negative association for our
variables.
7.4 The Fitted Model
data2 %>% group_by(state) %>% summarize(n = n(), pearson_r = cor(suicides, food_insecurity), r.squared = pearson_r^2) %>% kable(digits = 3)
| state | n | pearson_r | r.squared |
|---|---|---|---|
| MA | 13 | -0.022 | 0.000 |
| MD | 24 | 0.510 | 0.260 |
| NY | 61 | 0.378 | 0.143 |
| OH | 87 | 0.499 | 0.249 |
| PA | 64 | 0.428 | 0.183 |
| VA | 115 | 0.444 | 0.197 |
In order to use OH as the baseline state, we have to
re-level it due to the default settings may have another level of state
to be the reference state.
data2 <- mutate(data2, state = relevel(state, "OH"))
model <- lm(suicides ~ food_insecurity + state, data = data2)
broom::tidy(model, conf.int = TRUE, conf.level = 0.90) %>% select(term, estimate, conf.low, conf.high, p.value ) %>% kable(digits = 3)
| term | estimate | conf.low | conf.high | p.value |
|---|---|---|---|---|
| (Intercept) | 2.123 | 2.003 | 2.244 | 0.000 |
| food_insecurity | 0.045 | 0.037 | 0.053 | 0.000 |
| stateMA | -0.163 | -0.299 | -0.027 | 0.048 |
| stateMD | -0.178 | -0.279 | -0.077 | 0.004 |
| stateNY | -0.200 | -0.275 | -0.125 | 0.000 |
| statePA | 0.162 | 0.088 | 0.236 | 0.000 |
| stateVA | 0.223 | 0.157 | 0.289 | 0.000 |
Based on the summary of the model, the coefficients are:
food_insecurity = 0.045 and
intercept is 2.123. The fitted model can be
written as: (1) For OH, it is
Log(Suicides) = 0.045 * food_insecurity + 2.123 (2) For MA,
it is Log(Suicides) = 0.045 * food_insecurity + 1.96 (3)
For MD, it is
Log(Suicides) = 0.045 * food_insecurity + 1.945 (4) For NY,
it is Log(Suicides) = 0.045 * food_insecurity + 1.923 (5)
For PA, it is
Log(Suicides) = 0.045 * food_insecurity + 2.285 (5) For VA,
it is Log(Suicides) = 0.045 * food_insecurity + 2.346
Because the coefficient for food_insecurity is larger than 0, it means
that we have a positive association between
food_insecurity and suicides. The intercept
coefficient means where our line will intersects with the y-axis. And
finally, the coefficient for each state is the difference of the
intercept compared to the baseline referece, i.e OH.In
addition, these coefficients are significant as the low confidences of
the interval are above 0 if it is positive number or as the high
confidences of the interval are lower than 0 if it is negative number.
In this case, for every coefficient, we have them to be significant.
broom::glance(model) %>% select(nobs, r.squared, adj.r.squared, sigma) %>% kable(digits = 3)
| nobs | r.squared | adj.r.squared | sigma |
|---|---|---|---|
| 364 | 0.387 | 0.377 | 0.262 |
From the table above, it indicates that the R-squared is 0.387 and
residual standard error is 0.262 with 364 observations ## Residual
Analysis Here is using the augment function to get the
fitted and residual values.
modelaug <- broom::augment(model, data = data2)
Here is plot for Normality in the residuals
p1 <- ggplot(modelaug, aes(sample = .resid)) +
geom_qq(col = "seagreen") + geom_qq_line(col = "black") +
theme(aspect.ratio = 1) +
labs(title = "Normal Q-Q: Model 1 Residuals")
p2 <- ggplot(modelaug, aes(x = .resid)) +
geom_histogram(aes(y = stat(density)),
bins = 20, fill = "seagreen", col = "yellow") +
stat_function(fun = dnorm,
args = list(mean = mean(modelaug$.resid),
sd = sd(modelaug$.resid)),
col = "black", lwd = 1.5) +
labs(title = "Hist + Normal Density: Model 1 Residuals")
p3 <- ggplot(modelaug, aes(x = .resid, y = "")) +
geom_boxplot(fill = "seagreen", outlier.color = "seagreen") +
labs(title = "Boxplot: Model 1 Residuals", y = "")
p1 + (p2 / p3 + plot_layout(heights = c(4,1)))
mosaic::favstats(~ .resid, data = modelaug) %>% kable(digits = 1)
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| -1 | -0.1 | 0 | 0.2 | 0.9 | 0 | 0.3 | 364 | 0 |
From the histogram, we might agree that it is normally distributed. However, we can notice that each end of the histogram extends much further than the normal distribution, so it can be considered as heavy-tailed distribution. And this can be confirmed by our QQ-Plot. However, with that, the Normal model may still be reasonable for a batch of data.
And here is the plot for assess residuals vs fitted values for non-linearty
ggplot(modelaug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_smooth(method = "lm", col = "red",
formula = y ~ x, se = FALSE) +
geom_smooth(method = "loess", col = "blue",
formula = y ~ x, se = FALSE) +
labs(title = "Model: Residuals vs. Fitted Values",
x = "Fitted suicides values", y = "Residuals")
From the plot, we do not see any significant curve when plotting the
residuals against the fitted values. However, this is not given the
‘fuzzy football’ pattern, which it seems like the variance across levels
of the fitted values are not really similar to each other.
Since I use transform model, I will create another model here to get the models’ prediction for the original outcome and compare to each actual outcome for Cuyahoga County in Ohio. The code is displayed below, with the predicted value as 13.49 while the actual value is 17.054
state1 <- mutate(data1, relevel(state, "OH"))
orgf <- lm(suicides ~ food_insecurity + state, data = data1)
orgfaug <- broom::augment(org, data = data1)
tibble(actual_value = data1[data1$state=="OH" & data1$county =="Cuyahoga County",]$suicides, predicted_value = orgfaug[orgfaug$state=="OH" & orgfaug$county =="Cuyahoga County",]$.fitted) |> kable(digits = 3)
| actual_value | predicted_value |
|---|---|
| 13.419 | 17.054 |
The two counties have the least successful at predicting outcome are Bronx County in New York with absolute residual values are 0.992 and Charlottesville city in Virginia with absolute residual value as 0.971.
modelaugorder <- mutate(modelaug, .resid = abs(.resid))
modelaugorder <- modelaugorder[order(-modelaugorder$.resid),]
modelaugorder[1:2,] |> kable(digits = 3)
| food_insecurity | suicides | state | county | .fitted | .resid | .hat | .sigma | .cooksd | .std.resid |
|---|---|---|---|---|---|---|---|---|---|
| 16.4 | 1.672 | NY | Bronx County | 2.664 | 0.992 | 0.026 | 0.257 | 0.055 | -3.829 |
| 12.4 | 1.935 | VA | Charlottesville city | 2.907 | 0.971 | 0.010 | 0.258 | 0.019 | -3.719 |
7.5 Conclusion and Analysis
For this analysis, I use the same predictor variable as in analysis
1, except that I will consider the association for each 6 states of my
choice in order to investigate how well does food insecurity predict the
suicides. From the scatter plot above, we could see that even though the
overall combination of 6 states gives us a positive association between
food insecurity and suicides, when we break it up into different states,
Massachusetts returns a negative association while others still
positive. With that interest, we furhter fitting our model with using
Ohio as the baseline reference. From our model, even the coefficient for
food insecurity as other intercepts are significant, but the coefficient
is too low (0.045). This indicates that there are only a
very weak positive interaction between the food insecurity and suicides,
but surprisingly to be a little bit higher than the coefficient when
combining all states together (0.040). This may be account
to the fact that we use Ohio as the reference state since
it has the highest Pearson’s correlation score after all. Moreover,
comparing to the original model in analysis 1, breaking up into states
doesn’t make a lot of difference as the coefficients of food insecurity
as well as the intercept are quite similar. In addition to that, I
barely agree that this would give us a right prediction for each state,
since for Massachusetts, it still yields a positive association when it
is supposed to be a negative one. The limitations are same as above, as
mine cannot be considered to be a representative sample for the U.S due
to bias. In addition to that, for the normalty of residuals, it seems to
be heavy-tailed and almost left-skewed. Therefore, it is very hard to
assume that we have a normal distribution for the residuals.
8 Session Information
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.1.2.9000 mosaic_1.8.4.2 mosaicData_0.20.3
[4] ggformula_0.10.2 Matrix_1.5-3 lattice_0.20-45
[7] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10
[10] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1
[13] tibble_3.1.8 ggplot2_3.3.6 tidyverse_1.3.2
[16] naniar_0.6.1 janitor_2.1.0 rmdformats_1.0.4
[19] knitr_1.40
loaded via a namespace (and not attached):
[1] googledrive_2.0.0 colorspace_2.0-3 deldir_1.0-6
[4] ellipsis_0.3.2 ggridges_0.5.4 visdat_0.5.3
[7] ggstance_0.3.5 snakecase_0.11.0 htmlTable_2.4.1
[10] base64enc_0.1-3 fs_1.5.2 rstudioapi_0.14
[13] farver_2.1.1 bit64_4.0.5 fansi_1.0.3
[16] lubridate_1.8.0 xml2_1.3.3 splines_4.2.1
[19] cachem_1.0.6 polyclip_1.10-0 Formula_1.2-4
[22] jsonlite_1.8.2 broom_1.0.1 cluster_2.1.3
[25] dbplyr_2.2.1 png_0.1-7 ggforce_0.4.1
[28] compiler_4.2.1 httr_1.4.4 backports_1.4.1
[31] assertthat_0.2.1 fastmap_1.1.0 gargle_1.2.1
[34] cli_3.4.1 tweenr_2.0.2 htmltools_0.5.3
[37] tools_4.2.1 gtable_0.3.1 glue_1.6.2
[40] Rcpp_1.0.9 cellranger_1.1.0 jquerylib_0.1.4
[43] vctrs_0.5.0 nlme_3.1-157 xfun_0.33
[46] rvest_1.0.3 lifecycle_1.0.3 mosaicCore_0.9.2.1
[49] googlesheets4_1.0.1 MASS_7.3-57 scales_1.2.1
[52] vroom_1.6.0 hms_1.1.2 parallel_4.2.1
[55] RColorBrewer_1.1-3 yaml_2.3.5 curl_4.3.3
[58] gridExtra_2.3 sass_0.4.2 labelled_2.10.0
[61] rpart_4.1.16 latticeExtra_0.6-30 stringi_1.7.8
[64] highr_0.9 checkmate_2.1.0 rlang_1.0.6
[67] pkgconfig_2.0.3 evaluate_0.16 labeling_0.4.2
[70] htmlwidgets_1.5.4 bit_4.0.4 tidyselect_1.1.2
[73] magrittr_2.0.3 bookdown_0.29 R6_2.5.1
[76] generics_0.1.3 Hmisc_4.7-1 DBI_1.1.3
[79] mgcv_1.8-40 pillar_1.8.1 haven_2.5.1
[82] foreign_0.8-82 withr_2.5.0 survival_3.3-1
[85] nnet_7.3-17 modelr_0.1.9 crayon_1.5.2
[88] interp_1.1-3 utf8_1.2.2 tzdb_0.3.0
[91] rmarkdown_2.16 jpeg_0.1-9 grid_4.2.1
[94] readxl_1.4.1 data.table_1.14.2 reprex_2.0.2
[97] digest_0.6.30 munsell_0.5.0 bslib_0.4.0