Introduction

I decided to work with a dataset that contains NYC vaccination rates by age, gender, borough, etc. This data contains 168 observations and 15 columns with rich data for this project.

Research Question and objective of the study

Relationship of age group and vaccination rate in NYC got the COVID 19 vaccine until 2023? Correlation between age group, etchnicity, fully vaccinated, and people with first dose. The main objective is to create several variables to study the relationship between several aspects of this dataset, find statistical values in columns, and combine columns values to answer my research questions above.

Data Source

I’m going to work with data New York City DOH Website: https://www.nyc.gov/site/doh/covid/covid-19-data-vaccines.page This dataset provides data about vaccinations rate by age, borough, ethnicity, etc.

Type of Study

This is a research and observational study, since the city of New York collected the data based on data from hospitals and vaccination sites throughout the city.

Libraries requiered for the project

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## ------------------------------------------------------------------------------
## 
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## 
## ------------------------------------------------------------------------------
## 
## 
## Attaching package: 'plyr'
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## 
## The following object is masked from 'package:purrr':
## 
##     compact
## 
## 
## 
## Attaching package: 'plotly'
## 
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## 
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following object is masked from 'package:graphics':
## 
##     layout
## 
## 
## 
## Attaching package: 'RCurl'
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     complete

## Warning: package 'tidyselect' was built under R version 4.3.3

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:plotly':
## 
##     select
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Load data from NYC DOH GitHub Account

## Rows: 168 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): BOROUGH, AGE_GROUP, RACE_ETHNICITY, COUNT_ADDITIONAL_CUMULATIVE, C...
## dbl  (7): POP_DENOMINATOR, COUNT_PARTIALLY_CUMULATIVE, COUNT_FULLY_CUMULATIV...
## date (1): DATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 15
##   DATE       BOROUGH  AGE_GROUP RACE_ETHNICITY  POP_DENOMINATOR
##   <date>     <chr>    <chr>     <chr>                     <dbl>
## 1 2023-09-12 Citywide '0-4      Asian/NHPI                67072
## 2 2023-09-12 Citywide '0-4      Black                    106526
## 3 2023-09-12 Citywide '0-4      Hispanic/Latino          175607
## 4 2023-09-12 Citywide '0-4      White                    152540
## 5 2023-09-12 Bronx    '0-4      Asian/NHPI                 3448
## 6 2023-09-12 Bronx    '0-4      Black                     26957
## # ℹ 10 more variables: COUNT_PARTIALLY_CUMULATIVE <dbl>,
## #   COUNT_FULLY_CUMULATIVE <dbl>, COUNT_1PLUS_CUMULATIVE <dbl>,
## #   COUNT_ADDITIONAL_CUMULATIVE <chr>,
## #   COUNT_BIVALENT_ADDITIONAL_CUMULATIVE <chr>, PERC_PARTIALLY <dbl>,
## #   PERC_FULLY <dbl>, PERC_1PLUS <dbl>, PERC_ADDITIONAL <chr>,
## #   PERC_BIVALENT_ADDITIONAL <chr>

Tyding the data

Replace spaces with _ and convert column titles to lower cases

## # A tibble: 6 × 15
##   date       borough  age_group race_ethnicity  pop_denominator
##   <date>     <chr>    <chr>     <chr>                     <dbl>
## 1 2023-09-12 Citywide '0-4      Asian/NHPI                67072
## 2 2023-09-12 Citywide '0-4      Black                    106526
## 3 2023-09-12 Citywide '0-4      Hispanic/Latino          175607
## 4 2023-09-12 Citywide '0-4      White                    152540
## 5 2023-09-12 Bronx    '0-4      Asian/NHPI                 3448
## 6 2023-09-12 Bronx    '0-4      Black                     26957
## # ℹ 10 more variables: count_partially_cumulative <dbl>,
## #   count_fully_cumulative <dbl>, count_1plus_cumulative <dbl>,
## #   count_additional_cumulative <chr>,
## #   count_bivalent_additional_cumulative <chr>, perc_partially <dbl>,
## #   perc_fully <dbl>, perc_1plus <dbl>, perc_additional <chr>,
## #   perc_bivalent_additional <chr>

Get summary of dataframe

##               X.....date         X..borough        X.age_group
## X   Min.   :2023-09-12   Length:168         Length:168        
## X.1 1st Qu.:2023-09-12   Class :character   Class :character  
## X.2 Median :2023-09-12   Mode  :character   Mode  :character  
## X.3 Mean   :2023-09-12                 <NA>               <NA>
## X.4 3rd Qu.:2023-09-12                 <NA>               <NA>
## X.5 Max.   :2023-09-12                 <NA>               <NA>
##         race_ethnicity   pop_denominator count_partially_cumulative
## X   Length:168         Min.   :   2575             Min.   :    51  
## X.1 Class :character   1st Qu.:  23269             1st Qu.:  1260  
## X.2 Mode  :character   Median :  66249             Median :  4285  
## X.3               <NA> Mean   : 194406             Mean   : 15799  
## X.4               <NA> 3rd Qu.: 180638             3rd Qu.: 13898  
## X.5               <NA> Max.   :2681976             Max.   :269918  
##     count_fully_cumulative count_1plus_cumulative count_additional_cumulative
## X        Min.   :     28        Min.   :     79            Length:168        
## X.1      1st Qu.:  10678        1st Qu.:  13025            Class :character  
## X.2      Median :  44037        Median :  48853            Mode  :character  
## X.3      Mean   : 143433        Mean   : 159232                          <NA>
## X.4      3rd Qu.: 146324        3rd Qu.: 156421                          <NA>
## X.5      Max.   :1841729        Max.   :2111647                          <NA>
##     count_bivalent_additional_cumulative   perc_partially    X..perc_fully
## X                     Length:168         Min.   : 1.010   Min.   :  0.94  
## X.1                   Class :character   1st Qu.: 4.270   1st Qu.: 55.11  
## X.2                   Mode  :character   Median : 7.175   Median : 73.09  
## X.3                                 <NA> Mean   : 7.551   Mean   : 68.39  
## X.4                                 <NA> 3rd Qu.: 9.088   3rd Qu.: 89.75  
## X.5                                 <NA> Max.   :37.190   Max.   :140.77  
##        X..perc_1plus    perc_additional perc_bivalent_additional
## X   Min.   :  2.42   Length:168               Length:168        
## X.1 1st Qu.: 62.45   Class :character         Class :character  
## X.2 Median : 78.31   Mode  :character         Mode  :character  
## X.3 Mean   : 75.94                 <NA>                     <NA>
## X.4 3rd Qu.: 99.78                 <NA>                     <NA>
## X.5 Max.   :155.95                 <NA>                     <NA>

Plot of people with one dose by age

Plot of Group Age with one dose of vaccine with a 95% CI

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Histogram of cumulative data of people with only one dose.

Density plot of people with only one dose

Plot of fully vaccinated by race

Histogram of people fully vaccinated

Densityplot of people fully vaccinated

Correlation Function

I created a function to find the correlation between the total of people with one dose, and those fully vaccinated

## [1] "Correlation = 0.9994"

Linear model function

I created a function for the m=linear model between the two columns that I’m working with.

## 
## Call:
## lm(formula = count_fully_cumulative ~ count_1plus_cumulative, 
##     data = nyc_file)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53032  -1455   -385   2156  45450 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.013e+02  7.926e+02   0.759    0.449    
## count_1plus_cumulative 8.970e-01  2.329e-03 385.113   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9080 on 166 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9989 
## F-statistic: 1.483e+05 on 1 and 166 DF,  p-value: < 2.2e-16

count_fully_cumulative = 60.13 + (8.970e -1)(count_1plus_cumulative )

Plot linear model using ploty.

Linear Regression Results

We have a positive relationship between Fully Vaccinated and one Dose. The intercept is within the data range, meaning that the that fits well with the residual SE, in this dataset. Hypothesis Testing give us a strong relationship between fully vaccinated and one dose, we accept the Null Hypothesis since there is a strong relationship between the two variables. Multiple R-squared: 0.9989 Adjusted R-squared: 0.9989

Filter data to get people fully vaccinated age 5-12

Summary of age_group column fully vaccinated

## # A tibble: 1 × 3
##   mean_dd median_dd     n
##     <dbl>     <dbl> <int>
## 1  30266.    20710.    24

Create a function to find Hispanic/latino rate fully vaccinated

Summary of latino function

## # A tibble: 1 × 3
##   mean_dd median_dd     n
##     <dbl>     <dbl> <int>
## 1 175403.    54522.    42

Filter data to get stats of Hispanic people in Queens

Summary of function boro_var

## # A tibble: 7 × 3
##   age_group median_dd iqr_rg
##   <chr>         <dbl>  <dbl>
## 1 '0-4           1330      0
## 2 '13-17        38672      0
## 3 '18-44       253550      0
## 4 '45-64       147925      0
## 5 '5-12         38925      0
## 6 '65+          57107      0
## 7 All ages     537509      0

Plot of young age groups fully vaccinated

Conclusion:

I selected two variables to work with, count_fully_cumulative and count_1plus_cumulative, after I created a variable to check the correlation between the two variables I’ve notice that is going in the same direction, meaning that I have a positive correlation between them.

From the linear regression model, I accepted the H_0 : Null Hypothesis since the is a strong relationship between them.

Sources:

https://github.com/nychealth/covid-vaccine-data/tree/main/people#coverage-by-democsv

https://www.nyc.gov/site/doh/covid/covid-19-data-vaccines.page

Final Project Presentation