Dataset of enrollment and the number of children with completed immunizations at kindergartens in 2001-2015, CA.

Loading libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.3
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tinytex)
library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(RColorBrewer)

Loading the CSV file and viewing the dataset

kinder <- read_csv("kindergarten_CA.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   district = col_character(),
##   sch_code = col_double(),
##   county = col_character(),
##   pub_priv = col_character(),
##   school = col_character(),
##   enrollment = col_double(),
##   complete = col_double(),
##   start_year = col_double()
## )
#view(kinder)

Checking the structure of data

As we can see, there are 8 variables and 110,382 rows in each of the variables. The names of the variables are applicable for analysis. There are 4 numerical discrete and 4 character/categorical variables in the dataset.

str(kinder)
## spec_tbl_df [110,382 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ district  : chr [1:110382] "Alameda Unified" "Alameda Unified" "Alameda Unified" "Alameda Unified" ...
##  $ sch_code  : num [1:110382] 6967434 6110779 6100374 6090013 6090039 ...
##  $ county    : chr [1:110382] "Alameda" "Alameda" "Alameda" "Alameda" ...
##  $ pub_priv  : chr [1:110382] "Private" "Public" "Public" "Public" ...
##  $ school    : chr [1:110382] "ALAMEDA CHRTN" "BAY FARM ELEM" "EARHART (AMELIA) ELEM" "EDISON ELEM" ...
##  $ enrollment: num [1:110382] 12 78 77 56 41 75 40 80 61 49 ...
##  $ complete  : num [1:110382] 11 77 73 53 41 65 34 76 61 43 ...
##  $ start_year: num [1:110382] 2001 2001 2001 2001 2001 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   district = col_character(),
##   ..   sch_code = col_double(),
##   ..   county = col_character(),
##   ..   pub_priv = col_character(),
##   ..   school = col_character(),
##   ..   enrollment = col_double(),
##   ..   complete = col_double(),
##   ..   start_year = col_double()
##   .. )

Removing all NA’s

kinder1 <- na.omit(kinder)

Look at the summary of the dataset

The reason for checking the summary is getting the main statistical characteristics of numerical variables.

summary(kinder1)
##    district            sch_code          county            pub_priv        
##  Length:108730      Min.   :   1501   Length:108730      Length:108730     
##  Class :character   1st Qu.:6019863   Class :character   Class :character  
##  Mode  :character   Median :6048201   Mode  :character   Mode  :character  
##                     Mean   :5884901                                        
##                     3rd Qu.:6120430                                        
##                     Max.   :9999999                                        
##     school            enrollment        complete        start_year  
##  Length:108730      Min.   : 10.00   Min.   :  0.00   Min.   :2001  
##  Class :character   1st Qu.: 34.00   1st Qu.: 29.00   1st Qu.:2004  
##  Mode  :character   Median : 68.00   Median : 61.00   Median :2008  
##                     Mean   : 70.77   Mean   : 64.89   Mean   :2008  
##                     3rd Qu.: 98.00   3rd Qu.: 91.00   3rd Qu.:2012  
##                     Max.   :981.00   Max.   :973.00   Max.   :2015

Adding a new variable in the dataset

Since there are two types of schools in the dataset (private and public), it might be interesting to explore the rate of immunization in these types of schools.

I added a new variable - a ratio of children that have been vaccinated to the total number of enrolled children.

kinder1 <- mutate(kinder1, percentage = complete / enrollment * 100)

#view(kinder1)

Making the Boxplot

This boxplot shows the percentage of completed immunizations by types of schools. The medians for both types of schools are almost the same, around 95%.

However, the lower fence (58%) and Q1 (83%) are lower in private schools that in public schools (76% and 89% respectively).

Also, the Interquartile Range (IQR = 17%) is wider for private schools than for public schools (9%).

P1 <- plot_ly(kinder1, x = ~pub_priv, y = ~percentage, color = ~pub_priv, colors = "Set1", type = "box") %>% 
  layout(title = "Percentage of Children Immunizations by School Type", xaxis = list(title = "Type of School"), yaxis = list(title = "Percentage"))
P1

Making correlation plot

The plot shows the relationship of two variables - the number of enrollment in schools and the number of children who have completed immunizations. The visualization is divided into two graphs by type of school.

On the graphs we can observe strong positive correlation between enrollment and completed immunizations in both private and public schools.

p2 <- ggplot(kinder1, aes(enrollment, complete, colour = factor(pub_priv))) +
  geom_point (alpha = 0.3, size = 1.3)+ 
  scale_colour_manual(values=c('#FF0000', '#4682B4'), # set colors for plot lines
                       name="Type of school", 
                       breaks=c("Private", "Public"),
                       labels=c("Private", "Public"))+ # set legend 
  facet_wrap(.~ pub_priv) +
  geom_smooth(method='lm',formula=y~x, size = 0.3, color = "black")+ # add linear regression
  ggtitle("Enrollment vs. Complete Immunizations", subtitle = 'By School Type')+
  theme_bw()
p2

#ggplotly(p2) # may be applied for more interactivity on the graphs

The correlation coefficient and statistacal summary.

The correlation coefficient is 0.98, which means strong positive correlation.

The model has the equation: complete = 0.9320(enrollment) - 1.0673

The Adjusted R-Squared value (0.9581) states that about 96% of the variation in the observations may be explained by this model.

p-value (< 2.2e-16) is significantly small, so the predictor (enrollment) is useful to the model.

cor(kinder1$enrollment, kinder1$complete) # correlation coefficient
## [1] 0.9788065
fit <- lm(complete ~ enrollment, data = kinder1) # statistacal summary
summary(fit)
## 
## Call:
## lm(formula = complete ~ enrollment, data = kinder1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -227.382   -1.486    1.933    4.166   59.753 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.0672596  0.0492995  -21.65   <2e-16 ***
## enrollment   0.9320223  0.0005914 1576.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.591 on 108728 degrees of freedom
## Multiple R-squared:  0.9581, Adjusted R-squared:  0.9581 
## F-statistic: 2.484e+06 on 1 and 108728 DF,  p-value: < 2.2e-16

Making heatmap plot

The plot shows the percentage of complete immunizations by county. The visualization is divided into two graphs by type of school.

The heatmap plot shows that for private schools there are more cases with a low rate of immunizations. Also, for private schools there is more missing data in the report.

kinder1 %>%
  mutate(county = reorder(county, percentage)) %>% # reorder counties by the means of percentage
  ggplot(aes(start_year, county,  fill = percentage)) +
  geom_tile(color = "white") +
  scale_x_continuous(breaks = c(2001,2003,2005,2007,2009,2011,2013,2015)) +
  scale_fill_gradientn(colors = brewer.pal(5, "RdYlGn")) +
  theme_classic() +
  theme(panel.grid = element_blank()) +
  ggtitle('Percentage of Children with Immunizations', subtitle = 'By School Type') +
  ylab("Counties") +
  xlab("Years") +
  facet_grid(.~ pub_priv)

Creating a new dataframe for private schools

The new dataset ‘kinder_priv’ contains data only for private schools. This data is needed to visualize two general trends by years in private schools.

kinder_priv <- kinder1 %>%
  filter(pub_priv %in% c("Private")) %>% # set dataset with privet schools
  select(enrollment, complete, start_year) %>%
  group_by(start_year) %>%
  summarize(total_enroll = sum(enrollment), total_immun = sum(complete)) 
kinder_priv
## # A tibble: 15 x 3
##    start_year total_enroll total_immun
##  *      <dbl>        <dbl>       <dbl>
##  1       2001        64750       57627
##  2       2002        61190       55718
##  3       2003        57313       51610
##  4       2004        56421       50958
##  5       2005        55972       50003
##  6       2006        52567       46903
##  7       2007        50427       44825
##  8       2008        46903       41919
##  9       2009        44305       38973
## 10       2010        43156       37591
## 11       2011        44136       38709
## 12       2012        43824       37745
## 13       2013        39762       34019
## 14       2014        38342       33279
## 15       2015        40352       35678

Making highchart plots for private schools

There is highchart plot with two graphs for private schools during 2001 - 2015. The first graph is the rate of the enrollment and the second is the rate of children who have completed immunizations. There is the decrease in both graphs.

Also, these two graphs visually look similar to each other, which confirms the strong positive correlation between enrollment and completed immunizations.

cols <- c("Red","black") # set colors for enrollment and immunization graphs for private schools
highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = 'Enrollment')),
    list(title = list(text = "Immunization"),
         opposite = TRUE)
  ) %>%
  hc_add_series(data = kinder_priv$total_enroll,
                name = "Enrollment",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = kinder_priv$total_immun,
                name = "Immunization",
                type = "line",
                yAxis = 0) %>%
  hc_xAxis(categories = kinder_priv$start_year,
           tickInterval = 2) %>%
  hc_colors(cols) %>%
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold")) %>%
    hc_title(
    text = "Enrollments and Completed Immunizations in Private Schools by Years",
    align = "left"
    )

Creating a new dataframe for public schools

Create a new dataset for public schools to visualize the rates of enrollments and number of children who have completed immunizations during 2001 - 2015.

kinder_pub <- kinder1 %>%
  filter(pub_priv %in% c("Public")) %>% #set dataset with public schools
    select(enrollment,complete, start_year) %>%
     group_by(start_year) %>%
    summarize(total_enroll = sum(enrollment), total_immun = sum(complete)) 
 kinder_pub
## # A tibble: 15 x 3
##    start_year total_enroll total_immun
##  *      <dbl>        <dbl>       <dbl>
##  1       2001       453104      413229
##  2       2002       452370      418727
##  3       2003       450367      418566
##  4       2004       448029      418034
##  5       2005       451252      421213
##  6       2006       445250      414976
##  7       2007       443199      410420
##  8       2008       449182      413280
##  9       2009       447348      409507
## 10       2010       461932      420818
## 11       2011       480200      438647
## 12       2012       481712      437126
## 13       2013       490768      444682
## 14       2014       493598      447977
## 15       2015       507168      473015

Making highchart plots for public schools

There is highchart plot with two graphs for public schools during 2001 - 2015. The first graph is the rate of the enrollment and the second is the rate of children who have completed immunizations. There is the increase in both graphs.

Also, these two graphs visually look similar to each other, which confirms the strong positive correlation between enrollment and completed immunizations.

cols1 <- c("#4682B4","black") # set colors for enrollment and immunization graphs for public schools
highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = 'Enrollment')),
    list(title = list(text = 'Immunization'),
         opposite = TRUE)
  ) %>%
  hc_add_series(data = kinder_pub$total_enroll,
                name = 'Enrollment',
                type = 'column',
                yAxis = 0) %>%
  hc_add_series(data = kinder_pub$total_immun,
                name = 'Immunization',
                type = 'line',
                yAxis = 0) %>%
  hc_xAxis(categories = kinder_pub$start_year,
           tickInterval = 2) %>%
hc_colors(cols1) %>%
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold")) %>%
    hc_title(
    text = "Enrollments and Completed Immunizations in Public Schools by Years",
    align = "left"
    )

Essay

The source of my dataset is the California Department of Public Health: https://www.cdph.ca.gov/Pages/PageNotFoundError.aspx?requestUrl=https://www.cdph.ca.gov/programs/immunize/Pages/ImmunizationLevels.aspx

The dataset documents enrollment and the number of children with completed immunizations at entry into kindergartens in California from 2001 to 2015. There were eight variables in the original dataset before I started to work with data. There are four numerical discrete variables: unique identifying code for each school, number of children enrolled, number of children with complete immunizations, year of entry. There are four character/categorical variables: whether the school is public or private, county, school district, school name. The dataset contains 110,382 rows in each variable. I cleaned up the dataset by removing all NA’s.

I have chosen this dataset because the topic of immunization is a popular theme in public discussion, especially during the COVID-19 pandemic. Also, I have a personal connection to the topic of children immunization because I have kids and one of them attended an elementary school in California in 2015. Personally, I always supported the requirement that mandates children’s immunizations in order to attend school. Having a strong personal opinion, I wanted to explore this topic in terms of data and trends.

The anti-vaccination movement has a long history, beginning in France in 1763 and continuing through to today. In the US, the roots of the anti-vaccination movement began in the 18th century with religious leaders describing them as the “devil’s work.” The campaign grew in the 19th and 20th centuries as a matter of human rights.

In 2018, about 86% of the world’s children received vaccines that would protect them against polio, diphtheria, tetanus, pertussis, and measles. Immunizations currently prevent 2-3 million deaths every year. Despite this success, more than 1.5 million people die from vaccine-preventable diseases each year.

During my research, I found out many articles stating that the immunization policy has been examined in California around 2013-2015. In 2016, California passed a new law eliminating the personal-belief exemption for vaccines. Currently, California law requires all children enrolled in state schools, both public and private, to have doctor-recommended immunizations. The new law policy was a reaction to a pertussis epidemic in California in 2014. The authors of an article ‘Pertussis epidemic - California, 2014’ states: “In 2014, the California Department of Public Health (CDPH) declared that a pertussis epidemic was occurring in the state when reported incidence was more than five times greater than baseline levels. The incidence of pertussis in the United States is cyclical, with peaks every 3-5 years, as the number of susceptible persons in the population increases. The last pertussis epidemic in California occurred in 2010, when approximately 9,000 cases were reported, including 808 hospitalizations and 10 infant deaths, for a statewide incidence of 24.6 cases per 100,000 population.”

Since the dataset contains the variables such as “Private” and “Public” schools, I decided to research the difference between these types of schools in terms of completed immunizations. Even though both types of schools are under the same law policy, they differ in many other aspects but the most important one is the source of funds.

The boxplot graph shows that the percentage of completed immunizations by types of schools. The medians for both types of schools are almost the same, around 95%. However, the Interquartile Range and lower fence for private schools are wider, which may indicate a more tolerant policy in this type of schools for accepting children without completed immunizations.

The correlation plot shows a strong positive correlation between enrollment and completed immunizations in both private and public schools. It may indicate that during 2001 - 2015 there were some immunization policies at kindergartens before the new law in 2016. Perhaps it was necessary to revise the previous policies towards taking under control occurrence new outbreaks of children diseases.

The heatmap plot visualizes the heterogeneity of immunization data in private schools vs public schools. The missing data has more presence in private schools which means that private schools may not report data appropriately.

The highchart plots visualize two major trends during 2001 - 2015: 1) the increase in the number of enrollments and the number of immunizations in public schools; 2) the decrease in the number of enrollments and the number of immunizations in private schools.

The follow-ups of this analysis may be researching similar immunization data about differences in public and private schools in other states.


References:

https://pubmed.ncbi.nlm.nih.gov/25474033/

https://www.choc.org/articles/immunization-guide/

https://www.chop.edu/centers-programs/vaccine-education-center/global-immunization/diseases-and-vaccines-world-view

https://www.medicalnewstoday.com/articles/anti-vaxxer#what-do-anti-vaxxers-believe

https://measlesrubellainitiative.org/anti-vaccination-movement/