In California, kindergarten is not compulsory like most other states; however, immunization is. Up to January 2016, parents were allowed to submit a personal beliefs exemption form in order not to immunize their children so they could attend school. This is no longer possible, since ex California Governor Brown signed into law Senate Bill (SB) 277.

Vaccination (not immunization) is currently a hot topic, and some are arguing that mandating a vaccine, violates individual rights and exceeds a government authority. A serious argument, right? But what about community health?

Now then, the following dataset, from the California Department of Public Health, is about kindergarten registration in the state of California and the quantity of children who were completely immunized during that process. The dataset is comprised of 6 categorical and 2 quantitative variables and the records are from 2001 thru 2015. Fortunately, the dataset is clean to begin with. Below is a list of all 8 variables and what they represent:

district: School district

sch_code: School code (Each school has there own unique code)

county: The county where the school is located

pub_priv: Is it private or public school

school: The name of the school

enrollment: Number of children registered

complete: “Number of children with complete immunizations”

start_year: “Year of entry (for the 2015-2016 school year, for example, this would be 2015)”

Read the dataset into Rstudio and show first few rows

#Set working directory
setwd("~/Data110")

#Read data into Rstudio
kindergaten <- read.csv("kindergarten_CA.csv")

# let's Check out the first few lines
head(kindergaten)
##          district sch_code  county pub_priv                school enrollment
## 1 Alameda Unified  6967434 Alameda  Private         ALAMEDA CHRTN         12
## 2 Alameda Unified  6110779 Alameda   Public         BAY FARM ELEM         78
## 3 Alameda Unified  6100374 Alameda   Public EARHART (AMELIA) ELEM         77
## 4 Alameda Unified  6090013 Alameda   Public           EDISON ELEM         56
## 5 Alameda Unified  6090039 Alameda   Public         FRANKLIN ELEM         41
## 6 Alameda Unified  6090047 Alameda   Public           HAIGHT ELEM         75
##   complete start_year
## 1       11       2001
## 2       77       2001
## 3       73       2001
## 4       53       2001
## 5       41       2001
## 6       65       2001

Load Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library('rvest')
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding
library('scales')
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library(ggplot2)

Let’s create a new data frame four our statistical analysis

Let’s add a new column to our dataset and a filter

Since the data is so vast, I randomly picking one county to work with

kfilter <-filter(kindergaten, county %in% c("Los Angeles" )) %>%
  filter(enrollment >= 1)%>%
  mutate(im_prop = complete/enrollment)
#view(kfilter)

Create a simple plot

kindplot <- kfilter %>%
  ggplot(aes(enrollment, complete))+
  labs(title = "Los Angeles County Kindergarten - Immunization Completion Rates",
  caption="Source: California Department of Public Health ")+
  xlab("Enrolment")+
  ylab("Children With Complete Immunization")+
  geom_point()+
  theme_minimal()
kindplot <- ggplotly(kindplot)
kindplot

About the outliers

The 4 dots on the right represent “Miles Ave. Elementary school” from 2001-2004, located in Huntinton Park, CA.

Business Insider (BI), ranked that city lowest in CA and 10th-worst nationally on a so-called “misery index”

Let’s add a linear regression to our chart

Let’s also divide our graph into two sections: Public and Private schools

Let’s remove these outliers and replot

kfilter1 <-filter(kindergaten, county %in% c("Los Angeles" )) %>%
  filter(enrollment >= 1 & enrollment < 350)%>%
  mutate(im_prop = complete/enrollment)
#view(kfilter1)

New Plot

kindplot2 <- kfilter1 %>%
  ggplot(aes(enrollment, complete, color=pub_priv))+
  labs(title = "Los Angeles County Kindergarten - Immunization and Enrollment",
  caption="Source: California Department of Public Health ")+
  xlab("Enrolment")+
  ylab("Children With Complete Immunization")+
  geom_point()+
  facet_wrap(~pub_priv)+
  geom_smooth(method = 'lm', formula =y~x, color="red")+
  theme_minimal()+
  scale_color_discrete(name =  " ")
kindplot2 <- ggplotly(kindplot2)
kindplot2

Create a few more objects for more Statistical Analysis

kfilterP <-filter(kindergaten, county %in% c("Los Angeles" ), pub_priv %in% c("Private") ) %>%
  filter(enrollment < 350)%>%
  mutate(im_prop = complete/enrollment)
view(kfilterP)
kfilterPu <-filter(kindergaten, county %in% c("Los Angeles" ), pub_priv %in% c("Public") ) %>%
  filter(enrollment < 350)%>%
  mutate(im_prop = complete/enrollment)
#view(kfilterPu)

Let’s calculate the correlation coefficient and find the linear regression model

# Correlation coefficient for both private and public kindergatens in Los Angeles county.
cor(kfilter1$enrollment, kfilter1$complete)
## [1] 0.9679528
# Correlation coefficient for private kindergatens in Los Angeles county.
cor(kfilterP$enrollment, kfilterP$complete)
## [1] 0.9578562
# Correlation coefficient for public kindergatens in Los Angeles county.
cor(kfilterPu$enrollment, kfilterPu$complete)
## [1] 0.9459023
kindfit <- lm(complete ~ enrollment, data =kfilter1)
summary(kindfit)
## 
## Call:
## lm(formula = complete ~ enrollment, data = kfilter1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -167.366   -1.590    2.020    5.592   29.973 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.058683   0.125080   0.469    0.639    
## enrollment  0.896083   0.001413 634.237   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.51 on 27077 degrees of freedom
## Multiple R-squared:  0.9369, Adjusted R-squared:  0.9369 
## F-statistic: 4.023e+05 on 1 and 27077 DF,  p-value: < 2.2e-16

Base on these calculations our model has the following equation: complete = 0.896(enrollment)+0.058

We also have a very small p-value of 2.2e-16 and an Ajusted R-Square value suggesting that about 94% of the variation in the observations maybe explained by our model.

Let’s create a new object that we will use for our final visualization

kinselect <- kindergaten %>%
  select(county, enrollment, complete, pub_priv, start_year)%>%
  mutate(im_prop=complete/enrollment)%>%
  na.omit()%>%
  group_by(county)%>%
  summarise(percentage= mean(im_prop)*100)

Let’s bring another dataset from the California SCDD (State Council on Developmental Disabilities) https://scdd.ca.gov/wp-content/uploads/sites/33/2019/03/Exhbit-A-SCDD-California-Poverty-Levels-by-County.pdf

Side note: Unfortunately, I was unable to use the Webscrapping method to Read this data into R since it’s an online PDF. SelectorGadget does not work on pdfs. As a result, I ended up converting the PDF file to and excel file and then saving it as a csv file.

levels<-read_csv("CaPovertyLevels.csv")
## New names:
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## Rows: 58 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): county
## dbl (1): p_below
## lgl (3): ...3, ...4, ...5
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(levels)
## # A tibble: 6 x 5
##   county    p_below ...3  ...4  ...5 
##   <chr>       <dbl> <lgl> <lgl> <lgl>
## 1 Alameda      12   NA    NA    NA   
## 2 Alpine       18.9 NA    NA    NA   
## 3 Amador       11.2 NA    NA    NA   
## 4 Butte        21.3 NA    NA    NA   
## 5 Calaveras    12.7 NA    NA    NA   
## 6 Colusa       13.5 NA    NA    NA

This dataset only has 1 quantity variable and one categorical variable:

County: The CA county’s name

p_below: Percent Below Property Level

Note, according to the Census Bureau, a poverty area "is typically defined when 20%

or more of the population lives below poverty level."

Our objectif is to merge the two data frames (Levels and kinselect)

and then create a scatterplot to map the immunization rates against the level of poverty

Perhaps, we will be able to answer the following question:

How does household income influence immunization decisions?

First let’s clean the “levels” dataset by removing the N/A columns.

levels <- levels[, colSums(is.na(levels)) < nrow(levels)]

Now, let’s join the two datasets

datajoin <- inner_join(kinselect, levels, by= "county")%>%
  arrange(county)

Our final visualization

#cols <- inferno(256, alpha = 1, begin = 0, end = 1, direction = -1)

highchart()%>%
  hc_add_series(name="County", data = datajoin, type="bubble",hcaes(x=p_below, y=percentage))%>%
  #hc_add_series(data = datajoin, type="line",hcaes(x=p_below, y=percentage))%>%
  hc_title(text="Poverty level and Immunization per County",style = list(color = "#2b908f", fontWeight = "bold"))%>%
  hc_subtitle(text="California Kingergaten Enrollment (2001-2015)",style = list(color = "#2b908f", fontWeight = "bold"))%>%
  hc_xAxis(title=list(text="Poverty Levels", style = list(color = "#980000",fontWeight = "bold")))%>%
  hc_yAxis(title=list(text="Percentage of Complete Immunization", style = list(color = "#980000",fontWeight = "bold")))%>%
  hc_legend(align = "right", verticalAlign="top")%>%
  hc_tooltip(shared=TRUE, borderColor = "black",pointFormat= "{point.county} <br> Level: {point.x} <br> Immunization Rate: {point.y:.2f}<br>")
 # hc_colors(cols)

Correlation coefficient for the above visualization

cor(datajoin$p_below, datajoin$percentage)
## [1] 0.05018824

The correlation coefficient for our last visualization is very weak, only 0.05. Base on our data, We can now conclude that immunization attitudes are not influenced by households’ income. However, we also need to point out that our datasets have some weaknesses: the poverty level dataset is current while the kindergarten dataset is very old.

Perhaps, immunization acceptance is a personal choice. A few recent studies seem to support that argument. For example, when it comes to covid-19 vaccines, Arce, Warren and Meriggi state that “Vaccine acceptance in LMICs is primarily explained by an interest in personal protection against COVID-19, while concern about side effects is the most common reason for hesitancy.”

Solís Arce, J.S., Warren, S.S., Meriggi, N.F. et al. COVID-19 vaccine acceptance and hesitancy in low- and middle-income countries. Nat Med 27, 1385–1394 (2021).

Vocabulary: low- and middle-income countries (LMICs)