library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.5
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tinytex)
library(dplyr)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.4
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(RColorBrewer)
kinder <- read_csv("kindergarten_CA.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## district = col_character(),
## sch_code = col_double(),
## county = col_character(),
## pub_priv = col_character(),
## school = col_character(),
## enrollment = col_double(),
## complete = col_double(),
## start_year = col_double()
## )
#view(kinder)
As we can see, there are 8 variables and 110,382 rows in each of the variables. The names of the variables are applicable for analysis. There are 4 numerical discrete and 4 character/categorical variables in the dataset.
str(kinder)
## spec_tbl_df [110,382 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ district : chr [1:110382] "Alameda Unified" "Alameda Unified" "Alameda Unified" "Alameda Unified" ...
## $ sch_code : num [1:110382] 6967434 6110779 6100374 6090013 6090039 ...
## $ county : chr [1:110382] "Alameda" "Alameda" "Alameda" "Alameda" ...
## $ pub_priv : chr [1:110382] "Private" "Public" "Public" "Public" ...
## $ school : chr [1:110382] "ALAMEDA CHRTN" "BAY FARM ELEM" "EARHART (AMELIA) ELEM" "EDISON ELEM" ...
## $ enrollment: num [1:110382] 12 78 77 56 41 75 40 80 61 49 ...
## $ complete : num [1:110382] 11 77 73 53 41 65 34 76 61 43 ...
## $ start_year: num [1:110382] 2001 2001 2001 2001 2001 ...
## - attr(*, "spec")=
## .. cols(
## .. district = col_character(),
## .. sch_code = col_double(),
## .. county = col_character(),
## .. pub_priv = col_character(),
## .. school = col_character(),
## .. enrollment = col_double(),
## .. complete = col_double(),
## .. start_year = col_double()
## .. )
kinder1 <- na.omit(kinder)
The reason for checking the summary is getting the main statistical characteristics of numerical variables.
summary(kinder1)
## district sch_code county pub_priv
## Length:108730 Min. : 1501 Length:108730 Length:108730
## Class :character 1st Qu.:6019863 Class :character Class :character
## Mode :character Median :6048201 Mode :character Mode :character
## Mean :5884901
## 3rd Qu.:6120430
## Max. :9999999
## school enrollment complete start_year
## Length:108730 Min. : 10.00 Min. : 0.00 Min. :2001
## Class :character 1st Qu.: 34.00 1st Qu.: 29.00 1st Qu.:2004
## Mode :character Median : 68.00 Median : 61.00 Median :2008
## Mean : 70.77 Mean : 64.89 Mean :2008
## 3rd Qu.: 98.00 3rd Qu.: 91.00 3rd Qu.:2012
## Max. :981.00 Max. :973.00 Max. :2015
Since there are two types of schools in the dataset (private and public), it might be interesting to explore the rate of immunization in these types of schools.
I added a new variable - a ratio of children that have been vaccinated to the total number of enrolled children.
kinder1 <- mutate(kinder1, percentage = complete / enrollment * 100)
#view(kinder1)
This boxplot shows the percentage of completed immunizations by types of schools. The medians for both types of schools are almost the same, around 95%.
However, the lower fence (58%) and Q1 (83%) are lower in private schools that in public schools (76% and 89% respectively).
Also, the Interquartile Range (IQR = 17%) is wider for private schools than for public schools (9%).
P1 <- plot_ly(kinder1, x = ~pub_priv, y = ~percentage, color = ~pub_priv, colors = "Set1", type = "box") %>%
layout(title = "Percentage of Children Immunizations by School Type", xaxis = list(title = "Type of School"), yaxis = list(title = "Percentage"))
P1
The plot shows the relationship of two variables - the number of enrollment in schools and the number of children who have completed immunizations. The visualization is divided into two graphs by type of school.
On the graphs we can observe strong positive correlation between enrollment and completed immunizations in both private and public schools.
p2 <- ggplot(kinder1, aes(enrollment, complete, colour = factor(pub_priv))) +
geom_point (alpha = 0.3, size = 1.3)+
scale_colour_manual(values=c('#FF0000', '#4682B4'), # set colors for plot lines
name="Type of school",
breaks=c("Private", "Public"),
labels=c("Private", "Public"))+ # set legend
facet_wrap(.~ pub_priv) +
geom_smooth(method='lm',formula=y~x, size = 0.3, color = "black")+ # add linear regression
ggtitle("Enrollment vs. Complete Immunizations", subtitle = 'By School Type')+
theme_bw()
p2
#ggplotly(p2) # may be applied for more interactivity on the graphs
The correlation coefficient is 0.98, which means strong positive correlation.
The model has the equation: complete = 0.9320(enrollment) - 1.0673
The Adjusted R-Squared value (0.9581) states that about 96% of the variation in the observations may be explained by this model.
p-value (< 2.2e-16) is significantly small, so the predictor (enrollment) is useful to the model.
cor(kinder1$enrollment, kinder1$complete) # correlation coefficient
## [1] 0.9788065
fit <- lm(complete ~ enrollment, data = kinder1) # statistacal summary
summary(fit)
##
## Call:
## lm(formula = complete ~ enrollment, data = kinder1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.382 -1.486 1.933 4.166 59.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0672596 0.0492995 -21.65 <2e-16 ***
## enrollment 0.9320223 0.0005914 1576.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.591 on 108728 degrees of freedom
## Multiple R-squared: 0.9581, Adjusted R-squared: 0.9581
## F-statistic: 2.484e+06 on 1 and 108728 DF, p-value: < 2.2e-16
The plot shows the percentage of complete immunizations by county. The visualization is divided into two graphs by type of school.
The heatmap plot shows that for private schools there are more cases with a low rate of immunizations. Also, for private schools there is more missing data in the report.
kinder1 %>%
mutate(county = reorder(county, percentage)) %>% # reorder counties by the means of percentage
ggplot(aes(start_year, county, fill = percentage)) +
geom_tile(color = "white") +
scale_x_continuous(breaks = c(2001,2003,2005,2007,2009,2011,2013,2015)) +
scale_fill_gradientn(colors = brewer.pal(5, "RdYlGn")) +
theme_classic() +
theme(panel.grid = element_blank()) +
ggtitle('Percentage of Children with Immunizations', subtitle = 'By School Type') +
ylab("Counties") +
xlab("Years") +
facet_grid(.~ pub_priv)
The new dataset ‘kinder_priv’ contains data only for private schools. This data is needed to visualize two general trends by years in private schools.
kinder_priv <- kinder1 %>%
filter(pub_priv %in% c("Private")) %>% # set dataset with privet schools
select(enrollment, complete, start_year) %>%
group_by(start_year) %>%
summarize(total_enroll = sum(enrollment), total_immun = sum(complete))
kinder_priv
## # A tibble: 15 x 3
## start_year total_enroll total_immun
## * <dbl> <dbl> <dbl>
## 1 2001 64750 57627
## 2 2002 61190 55718
## 3 2003 57313 51610
## 4 2004 56421 50958
## 5 2005 55972 50003
## 6 2006 52567 46903
## 7 2007 50427 44825
## 8 2008 46903 41919
## 9 2009 44305 38973
## 10 2010 43156 37591
## 11 2011 44136 38709
## 12 2012 43824 37745
## 13 2013 39762 34019
## 14 2014 38342 33279
## 15 2015 40352 35678
There is highchart plot with two graphs for private schools during 2001 - 2015. The first graph is the rate of the enrollment and the second is the rate of children who have completed immunizations. There is the decrease in both graphs.
Also, these two graphs visually look similar to each other, which confirms the strong positive correlation between enrollment and completed immunizations.
cols <- c("Red","black") # set colors for enrollment and immunization graphs for private schools
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = 'Enrollment')),
list(title = list(text = "Immunization"),
opposite = TRUE)
) %>%
hc_add_series(data = kinder_priv$total_enroll,
name = "Enrollment",
type = "column",
yAxis = 0) %>%
hc_add_series(data = kinder_priv$total_immun,
name = "Immunization",
type = "line",
yAxis = 0) %>%
hc_xAxis(categories = kinder_priv$start_year,
tickInterval = 2) %>%
hc_colors(cols) %>%
hc_chart(style = list(fontFamily = "Georgia",
fontWeight = "bold")) %>%
hc_title(
text = "Enrollments and Completed Immunizations in Private Schools by Years",
align = "left"
)
Create a new dataset for public schools to visualize the rates of enrollments and number of children who have completed immunizations during 2001 - 2015.
kinder_pub <- kinder1 %>%
filter(pub_priv %in% c("Public")) %>% #set dataset with public schools
select(enrollment,complete, start_year) %>%
group_by(start_year) %>%
summarize(total_enroll = sum(enrollment), total_immun = sum(complete))
kinder_pub
## # A tibble: 15 x 3
## start_year total_enroll total_immun
## * <dbl> <dbl> <dbl>
## 1 2001 453104 413229
## 2 2002 452370 418727
## 3 2003 450367 418566
## 4 2004 448029 418034
## 5 2005 451252 421213
## 6 2006 445250 414976
## 7 2007 443199 410420
## 8 2008 449182 413280
## 9 2009 447348 409507
## 10 2010 461932 420818
## 11 2011 480200 438647
## 12 2012 481712 437126
## 13 2013 490768 444682
## 14 2014 493598 447977
## 15 2015 507168 473015
There is highchart plot with two graphs for public schools during 2001 - 2015. The first graph is the rate of the enrollment and the second is the rate of children who have completed immunizations. There is the increase in both graphs.
Also, these two graphs visually look similar to each other, which confirms the strong positive correlation between enrollment and completed immunizations.
cols1 <- c("#4682B4","black") # set colors for enrollment and immunization graphs for public schools
highchart() %>%
hc_yAxis_multiples(
list(title = list(text = 'Enrollment')),
list(title = list(text = 'Immunization'),
opposite = TRUE)
) %>%
hc_add_series(data = kinder_pub$total_enroll,
name = 'Enrollment',
type = 'column',
yAxis = 0) %>%
hc_add_series(data = kinder_pub$total_immun,
name = 'Immunization',
type = 'line',
yAxis = 0) %>%
hc_xAxis(categories = kinder_pub$start_year,
tickInterval = 2) %>%
hc_colors(cols1) %>%
hc_chart(style = list(fontFamily = "Georgia",
fontWeight = "bold")) %>%
hc_title(
text = "Enrollments and Completed Immunizations in Public Schools by Years",
align = "left"
)
The source of my dataset is the California Department of Public Health: https://www.cdph.ca.gov/Pages/PageNotFoundError.aspx?requestUrl=https://www.cdph.ca.gov/programs/immunize/Pages/ImmunizationLevels.aspx
The dataset documents enrollment and the number of children with completed immunizations at entry into kindergartens in California from 2001 to 2015. There were eight variables in the original dataset before I started to work with data. There are four numerical discrete variables: unique identifying code for each school, number of children enrolled, number of children with complete immunizations, year of entry. There are four character/categorical variables: whether the school is public or private, county, school district, school name. The dataset contains 110,382 rows in each variable. I cleaned up the dataset by removing all NA’s.
I have chosen this dataset because the topic of immunization is a popular theme in public discussion, especially during the COVID-19 pandemic. Also, I have a personal connection to the topic of children immunization because I have kids and one of them attended an elementary school in California in 2015. Personally, I always supported the requirement that mandates children’s immunizations in order to attend school. Having a strong personal opinion, I wanted to explore this topic in terms of data and trends.
The anti-vaccination movement has a long history, beginning in France in 1763 and continuing through to today. In the US, the roots of the anti-vaccination movement began in the 18th century with religious leaders describing them as the “devil’s work.” The campaign grew in the 19th and 20th centuries as a matter of human rights.
In 2018, about 86% of the world’s children received vaccines that would protect them against polio, diphtheria, tetanus, pertussis, and measles. Immunizations currently prevent 2-3 million deaths every year. Despite this success, more than 1.5 million people die from vaccine-preventable diseases each year.
During my research, I found out many articles stating that the immunization policy has been examined in California around 2013-2015. In 2016, California passed a new law eliminating the personal-belief exemption for vaccines. Currently, California law requires all children enrolled in state schools, both public and private, to have doctor-recommended immunizations. The new law policy was a reaction to a pertussis epidemic in California in 2014. The authors of an article ‘Pertussis epidemic - California, 2014’ states: “In 2014, the California Department of Public Health (CDPH) declared that a pertussis epidemic was occurring in the state when reported incidence was more than five times greater than baseline levels. The incidence of pertussis in the United States is cyclical, with peaks every 3-5 years, as the number of susceptible persons in the population increases. The last pertussis epidemic in California occurred in 2010, when approximately 9,000 cases were reported, including 808 hospitalizations and 10 infant deaths, for a statewide incidence of 24.6 cases per 100,000 population.”
Since the dataset contains the variables such as “Private” and “Public” schools, I decided to research the difference between these types of schools in terms of completed immunizations. Even though both types of schools are under the same law policy, they differ in many other aspects but the most important one is the source of funds.
The boxplot graph shows that the percentage of completed immunizations by types of schools. The medians for both types of schools are almost the same, around 95%. However, the Interquartile Range and lower fence for private schools are wider, which may indicate a more tolerant policy in this type of schools for accepting children without completed immunizations.
The correlation plot shows a strong positive correlation between enrollment and completed immunizations in both private and public schools. It may indicate that during 2001 - 2015 there were some immunization policies at kindergartens before the new law in 2016. Perhaps it was necessary to revise the previous policies towards taking under control occurrence new outbreaks of children diseases.
The heatmap plot visualizes the heterogeneity of immunization data in private schools vs public schools. The missing data has more presence in private schools which means that private schools may not report data appropriately.
The highchart plots visualize two major trends during 2001 - 2015: 1) the increase in the number of enrollments and the number of immunizations in public schools; 2) the decrease in the number of enrollments and the number of immunizations in private schools.
The follow-ups of this analysis may be researching similar immunization data about differences in public and private schools in other states.
References:
https://pubmed.ncbi.nlm.nih.gov/25474033/
https://www.choc.org/articles/immunization-guide/
https://www.medicalnewstoday.com/articles/anti-vaxxer#what-do-anti-vaxxers-believe
https://measlesrubellainitiative.org/anti-vaccination-movement/