Victor Torres
2024-05-14
I decided to work with a dataset that contains NYC vaccination rates by age, gender, borough, etc. This data contains 168 observations and 15 columns with rich data for this project.
Relationship of age group and vaccination rate in NYC got the COVID 19 vaccine until 2023? Correlation between age group, etchnicity, fully vaccinated, and people with first dose. The main objective is to create several variables to study the relationship between several aspects of this dataset, find statistical values in columns, and combine columns values to answer my research questions above.
I’m going to work with data New York City DOH Website: https://www.nyc.gov/site/doh/covid/covid-19-data-vaccines.page This dataset provides data about vaccinations rate by age, borough, ethnicity, etc.
This is a research and observational study, since the city of New York collected the data based on data from hospitals and vaccination sites throughout the city.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ------------------------------------------------------------------------------
##
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
##
## ------------------------------------------------------------------------------
##
##
## Attaching package: 'plyr'
##
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
##
## The following object is masked from 'package:purrr':
##
## compact
##
##
##
## Attaching package: 'plotly'
##
##
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
##
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following object is masked from 'package:graphics':
##
## layout
##
##
##
## Attaching package: 'RCurl'
##
##
## The following object is masked from 'package:tidyr':
##
## complete
## Warning: package 'tidyselect' was built under R version 4.3.3
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
##
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:plotly':
##
## select
##
## The following object is masked from 'package:dplyr':
##
## select
## Rows: 168 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): BOROUGH, AGE_GROUP, RACE_ETHNICITY, COUNT_ADDITIONAL_CUMULATIVE, C...
## dbl (7): POP_DENOMINATOR, COUNT_PARTIALLY_CUMULATIVE, COUNT_FULLY_CUMULATIV...
## date (1): DATE
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 15
## DATE BOROUGH AGE_GROUP RACE_ETHNICITY POP_DENOMINATOR
## <date> <chr> <chr> <chr> <dbl>
## 1 2023-09-12 Citywide '0-4 Asian/NHPI 67072
## 2 2023-09-12 Citywide '0-4 Black 106526
## 3 2023-09-12 Citywide '0-4 Hispanic/Latino 175607
## 4 2023-09-12 Citywide '0-4 White 152540
## 5 2023-09-12 Bronx '0-4 Asian/NHPI 3448
## 6 2023-09-12 Bronx '0-4 Black 26957
## # ℹ 10 more variables: COUNT_PARTIALLY_CUMULATIVE <dbl>,
## # COUNT_FULLY_CUMULATIVE <dbl>, COUNT_1PLUS_CUMULATIVE <dbl>,
## # COUNT_ADDITIONAL_CUMULATIVE <chr>,
## # COUNT_BIVALENT_ADDITIONAL_CUMULATIVE <chr>, PERC_PARTIALLY <dbl>,
## # PERC_FULLY <dbl>, PERC_1PLUS <dbl>, PERC_ADDITIONAL <chr>,
## # PERC_BIVALENT_ADDITIONAL <chr>
Replace spaces with _ and convert column titles to lower cases
## # A tibble: 6 × 15
## date borough age_group race_ethnicity pop_denominator
## <date> <chr> <chr> <chr> <dbl>
## 1 2023-09-12 Citywide '0-4 Asian/NHPI 67072
## 2 2023-09-12 Citywide '0-4 Black 106526
## 3 2023-09-12 Citywide '0-4 Hispanic/Latino 175607
## 4 2023-09-12 Citywide '0-4 White 152540
## 5 2023-09-12 Bronx '0-4 Asian/NHPI 3448
## 6 2023-09-12 Bronx '0-4 Black 26957
## # ℹ 10 more variables: count_partially_cumulative <dbl>,
## # count_fully_cumulative <dbl>, count_1plus_cumulative <dbl>,
## # count_additional_cumulative <chr>,
## # count_bivalent_additional_cumulative <chr>, perc_partially <dbl>,
## # perc_fully <dbl>, perc_1plus <dbl>, perc_additional <chr>,
## # perc_bivalent_additional <chr>
## X.....date X..borough X.age_group
## X Min. :2023-09-12 Length:168 Length:168
## X.1 1st Qu.:2023-09-12 Class :character Class :character
## X.2 Median :2023-09-12 Mode :character Mode :character
## X.3 Mean :2023-09-12 <NA> <NA>
## X.4 3rd Qu.:2023-09-12 <NA> <NA>
## X.5 Max. :2023-09-12 <NA> <NA>
## race_ethnicity pop_denominator count_partially_cumulative
## X Length:168 Min. : 2575 Min. : 51
## X.1 Class :character 1st Qu.: 23269 1st Qu.: 1260
## X.2 Mode :character Median : 66249 Median : 4285
## X.3 <NA> Mean : 194406 Mean : 15799
## X.4 <NA> 3rd Qu.: 180638 3rd Qu.: 13898
## X.5 <NA> Max. :2681976 Max. :269918
## count_fully_cumulative count_1plus_cumulative count_additional_cumulative
## X Min. : 28 Min. : 79 Length:168
## X.1 1st Qu.: 10678 1st Qu.: 13025 Class :character
## X.2 Median : 44037 Median : 48853 Mode :character
## X.3 Mean : 143433 Mean : 159232 <NA>
## X.4 3rd Qu.: 146324 3rd Qu.: 156421 <NA>
## X.5 Max. :1841729 Max. :2111647 <NA>
## count_bivalent_additional_cumulative perc_partially X..perc_fully
## X Length:168 Min. : 1.010 Min. : 0.94
## X.1 Class :character 1st Qu.: 4.270 1st Qu.: 55.11
## X.2 Mode :character Median : 7.175 Median : 73.09
## X.3 <NA> Mean : 7.551 Mean : 68.39
## X.4 <NA> 3rd Qu.: 9.088 3rd Qu.: 89.75
## X.5 <NA> Max. :37.190 Max. :140.77
## X..perc_1plus perc_additional perc_bivalent_additional
## X Min. : 2.42 Length:168 Length:168
## X.1 1st Qu.: 62.45 Class :character Class :character
## X.2 Median : 78.31 Mode :character Mode :character
## X.3 Mean : 75.94 <NA> <NA>
## X.4 3rd Qu.: 99.78 <NA> <NA>
## X.5 Max. :155.95 <NA> <NA>
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
I created a function to find the correlation between the total of people with one dose, and those fully vaccinated
## [1] "Correlation = 0.9994"
I created a function for the m=linear model between the two columns that I’m working with.
##
## Call:
## lm(formula = count_fully_cumulative ~ count_1plus_cumulative,
## data = nyc_file)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53032 -1455 -385 2156 45450
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.013e+02 7.926e+02 0.759 0.449
## count_1plus_cumulative 8.970e-01 2.329e-03 385.113 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9080 on 166 degrees of freedom
## Multiple R-squared: 0.9989, Adjusted R-squared: 0.9989
## F-statistic: 1.483e+05 on 1 and 166 DF, p-value: < 2.2e-16
We have a positive relationship between Fully Vaccinated and one Dose. The intercept is within the data range, meaning that the that fits well with the residual SE, in this dataset. Hypothesis Testing give us a strong relationship between fully vaccinated and one dose, we accept the Null Hypothesis since there is a strong relationship between the two variables. Multiple R-squared: 0.9989 Adjusted R-squared: 0.9989
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 30266. 20710. 24
## # A tibble: 1 × 3
## mean_dd median_dd n
## <dbl> <dbl> <int>
## 1 175403. 54522. 42
## # A tibble: 7 × 3
## age_group median_dd iqr_rg
## <chr> <dbl> <dbl>
## 1 '0-4 1330 0
## 2 '13-17 38672 0
## 3 '18-44 253550 0
## 4 '45-64 147925 0
## 5 '5-12 38925 0
## 6 '65+ 57107 0
## 7 All ages 537509 0
I selected two variables to work with, count_fully_cumulative and count_1plus_cumulative, after I created a variable to check the correlation between the two variables I’ve notice that is going in the same direction, meaning that I have a positive correlation between them.
From the linear regression model, I accepted the H_0 : Null Hypothesis since the is a strong relationship between them.
Sources:
https://github.com/nychealth/covid-vaccine-data/tree/main/people#coverage-by-democsv
https://www.nyc.gov/site/doh/covid/covid-19-data-vaccines.page