In California, kindergarten is not compulsory like most other states; however, immunization is. Up to January 2016, parents were allowed to submit a personal beliefs exemption form in order not to immunize their children so they could attend school. This is no longer possible, since ex California Governor Brown signed into law Senate Bill (SB) 277.
Vaccination (not immunization) is currently a hot topic, and some are arguing that mandating a vaccine, violates individual rights and exceeds a government authority. A serious argument, right? But what about community health?
Now then, the following dataset, from the California Department of Public Health, is about kindergarten registration in the state of California and the quantity of children who were completely immunized during that process. The dataset is comprised of 6 categorical and 2 quantitative variables and the records are from 2001 thru 2015. Fortunately, the dataset is clean to begin with. Below is a list of all 8 variables and what they represent:
district: School district
sch_code: School code (Each school has there own unique code)
county: The county where the school is located
pub_priv: Is it private or public school
school: The name of the school
enrollment: Number of children registered
complete: “Number of children with complete immunizations”
start_year: “Year of entry (for the 2015-2016 school year, for example, this would be 2015)”
Read the dataset into Rstudio and show first few rows
#Set working directory
setwd("~/Data110")
#Read data into Rstudio
kindergaten <- read.csv("kindergarten_CA.csv")
# let's Check out the first few lines
head(kindergaten)
## district sch_code county pub_priv school enrollment
## 1 Alameda Unified 6967434 Alameda Private ALAMEDA CHRTN 12
## 2 Alameda Unified 6110779 Alameda Public BAY FARM ELEM 78
## 3 Alameda Unified 6100374 Alameda Public EARHART (AMELIA) ELEM 77
## 4 Alameda Unified 6090013 Alameda Public EDISON ELEM 56
## 5 Alameda Unified 6090039 Alameda Public FRANKLIN ELEM 41
## 6 Alameda Unified 6090047 Alameda Public HAIGHT ELEM 75
## complete start_year
## 1 11 2001
## 2 77 2001
## 3 73 2001
## 4 53 2001
## 5 41 2001
## 6 65 2001
Load Libraries
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library('rvest')
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library('scales')
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
library(ggplot2)
Let’s create a new data frame four our statistical analysis
Let’s add a new column to our dataset and a filter
Since the data is so vast, I randomly picking one county to work with
kfilter <-filter(kindergaten, county %in% c("Los Angeles" )) %>%
filter(enrollment >= 1)%>%
mutate(im_prop = complete/enrollment)
#view(kfilter)
Create a simple plot
kindplot <- kfilter %>%
ggplot(aes(enrollment, complete))+
labs(title = "Los Angeles County Kindergarten - Immunization Completion Rates",
caption="Source: California Department of Public Health ")+
xlab("Enrolment")+
ylab("Children With Complete Immunization")+
geom_point()+
theme_minimal()
kindplot <- ggplotly(kindplot)
kindplot
About the outliers
The 4 dots on the right represent “Miles Ave. Elementary school” from 2001-2004, located in Huntinton Park, CA.
Business Insider (BI), ranked that city lowest in CA and 10th-worst nationally on a so-called “misery index”
Let’s add a linear regression to our chart
Let’s also divide our graph into two sections: Public and Private schools
Let’s remove these outliers and replot
kfilter1 <-filter(kindergaten, county %in% c("Los Angeles" )) %>%
filter(enrollment >= 1 & enrollment < 350)%>%
mutate(im_prop = complete/enrollment)
#view(kfilter1)
New Plot
kindplot2 <- kfilter1 %>%
ggplot(aes(enrollment, complete, color=pub_priv))+
labs(title = "Los Angeles County Kindergarten - Immunization and Enrollment",
caption="Source: California Department of Public Health ")+
xlab("Enrolment")+
ylab("Children With Complete Immunization")+
geom_point()+
facet_wrap(~pub_priv)+
geom_smooth(method = 'lm', formula =y~x, color="red")+
theme_minimal()+
scale_color_discrete(name = " ")
kindplot2 <- ggplotly(kindplot2)
kindplot2
Create a few more objects for more Statistical Analysis
kfilterP <-filter(kindergaten, county %in% c("Los Angeles" ), pub_priv %in% c("Private") ) %>%
filter(enrollment < 350)%>%
mutate(im_prop = complete/enrollment)
view(kfilterP)
kfilterPu <-filter(kindergaten, county %in% c("Los Angeles" ), pub_priv %in% c("Public") ) %>%
filter(enrollment < 350)%>%
mutate(im_prop = complete/enrollment)
#view(kfilterPu)
Let’s calculate the correlation coefficient and find the linear regression model
# Correlation coefficient for both private and public kindergatens in Los Angeles county.
cor(kfilter1$enrollment, kfilter1$complete)
## [1] 0.9679528
# Correlation coefficient for private kindergatens in Los Angeles county.
cor(kfilterP$enrollment, kfilterP$complete)
## [1] 0.9578562
# Correlation coefficient for public kindergatens in Los Angeles county.
cor(kfilterPu$enrollment, kfilterPu$complete)
## [1] 0.9459023
kindfit <- lm(complete ~ enrollment, data =kfilter1)
summary(kindfit)
##
## Call:
## lm(formula = complete ~ enrollment, data = kfilter1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -167.366 -1.590 2.020 5.592 29.973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.058683 0.125080 0.469 0.639
## enrollment 0.896083 0.001413 634.237 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.51 on 27077 degrees of freedom
## Multiple R-squared: 0.9369, Adjusted R-squared: 0.9369
## F-statistic: 4.023e+05 on 1 and 27077 DF, p-value: < 2.2e-16
Base on these calculations our model has the following equation: complete = 0.896(enrollment)+0.058
We also have a very small p-value of 2.2e-16 and an Ajusted R-Square value suggesting that about 94% of the variation in the observations maybe explained by our model.
Let’s create a new object that we will use for our final visualization
kinselect <- kindergaten %>%
select(county, enrollment, complete, pub_priv, start_year)%>%
mutate(im_prop=complete/enrollment)%>%
na.omit()%>%
group_by(county)%>%
summarise(percentage= mean(im_prop)*100)
Side note: Unfortunately, I was unable to use the Webscrapping method to Read this data into R since it’s an online PDF. SelectorGadget does not work on pdfs. As a result, I ended up converting the PDF file to and excel file and then saving it as a csv file.
levels<-read_csv("CaPovertyLevels.csv")
## New names:
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## Rows: 58 Columns: 5
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): county
## dbl (1): p_below
## lgl (3): ...3, ...4, ...5
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(levels)
## # A tibble: 6 x 5
## county p_below ...3 ...4 ...5
## <chr> <dbl> <lgl> <lgl> <lgl>
## 1 Alameda 12 NA NA NA
## 2 Alpine 18.9 NA NA NA
## 3 Amador 11.2 NA NA NA
## 4 Butte 21.3 NA NA NA
## 5 Calaveras 12.7 NA NA NA
## 6 Colusa 13.5 NA NA NA
This dataset only has 1 quantity variable and one categorical variable:
County: The CA county’s name
p_below: Percent Below Property Level
Note, according to the Census Bureau, a poverty area "is typically defined when 20%
or more of the population lives below poverty level."
Our objectif is to merge the two data frames (Levels and kinselect)
and then create a scatterplot to map the immunization rates against the level of poverty
Perhaps, we will be able to answer the following question:
How does household income influence immunization decisions?
First let’s clean the “levels” dataset by removing the N/A columns.
levels <- levels[, colSums(is.na(levels)) < nrow(levels)]
Now, let’s join the two datasets
datajoin <- inner_join(kinselect, levels, by= "county")%>%
arrange(county)
Our final visualization
#cols <- inferno(256, alpha = 1, begin = 0, end = 1, direction = -1)
highchart()%>%
hc_add_series(name="County", data = datajoin, type="bubble",hcaes(x=p_below, y=percentage))%>%
#hc_add_series(data = datajoin, type="line",hcaes(x=p_below, y=percentage))%>%
hc_title(text="Poverty level and Immunization per County",style = list(color = "#2b908f", fontWeight = "bold"))%>%
hc_subtitle(text="California Kingergaten Enrollment (2001-2015)",style = list(color = "#2b908f", fontWeight = "bold"))%>%
hc_xAxis(title=list(text="Poverty Levels", style = list(color = "#980000",fontWeight = "bold")))%>%
hc_yAxis(title=list(text="Percentage of Complete Immunization", style = list(color = "#980000",fontWeight = "bold")))%>%
hc_legend(align = "right", verticalAlign="top")%>%
hc_tooltip(shared=TRUE, borderColor = "black",pointFormat= "{point.county} <br> Level: {point.x} <br> Immunization Rate: {point.y:.2f}<br>")
# hc_colors(cols)
Correlation coefficient for the above visualization
cor(datajoin$p_below, datajoin$percentage)
## [1] 0.05018824
The correlation coefficient for our last visualization is very weak, only 0.05. Base on our data, We can now conclude that immunization attitudes are not influenced by households’ income. However, we also need to point out that our datasets have some weaknesses: the poverty level dataset is current while the kindergarten dataset is very old.
Perhaps, immunization acceptance is a personal choice. A few recent studies seem to support that argument. For example, when it comes to covid-19 vaccines, Arce, Warren and Meriggi state that “Vaccine acceptance in LMICs is primarily explained by an interest in personal protection against COVID-19, while concern about side effects is the most common reason for hesitancy.”
Solís Arce, J.S., Warren, S.S., Meriggi, N.F. et al. COVID-19 vaccine acceptance and hesitancy in low- and middle-income countries. Nat Med 27, 1385–1394 (2021).
Vocabulary: low- and middle-income countries (LMICs)