library(readr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v stringr 1.4.0
## v ggplot2 3.2.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(RColorBrewer)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
setwd("C:/Users/Goombakiller55/Documents/College/DATA 110")
df<-read.csv("kindergarten_CA.csv")
df<-na.omit(df)
names(df) <- tolower(names(df))
names(df)<-gsub(" ","",names(df))
str(df)
## 'data.frame': 108730 obs. of 8 variables:
## $ district : Factor w/ 1001 levels "Abc Unified",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ sch_code : int 6967434 6110779 6100374 6090013 6090039 6090047 6090062 6090005 6090088 6090021 ...
## $ county : Factor w/ 58 levels "Alameda","Alpine",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pub_priv : Factor w/ 2 levels "Private","Public": 1 2 2 2 2 2 2 2 2 2 ...
## $ school : Factor w/ 23140 levels "(ERNESTO) GALARZA ELE",..: 284 1386 5514 5675 6926 8010 11371 11625 12677 14686 ...
## $ enrollment: int 12 78 77 56 41 75 40 80 61 49 ...
## $ complete : int 11 77 73 53 41 65 34 76 61 43 ...
## $ start_year: int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
## - attr(*, "na.action")= 'omit' Named int 73434 73460 73527 74597 74688 74772 74907 75167 75579 75580 ...
## ..- attr(*, "names")= chr "73434" "73460" "73527" "74597" ...
df <- mutate(df, percent = (complete/enrollment)*100)
summary(df$percent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 88.06 95.00 90.33 98.46 100.00
Looking at the summary, it looks like most of the counties in California have every Kindergartener fully vaccinated, since the mean is 90.33 and even the 1st Qu is 88.06, despite having a min of 0.
ggplot(df, mapping=aes(x = county, y= percent)) +
xlab("County") +
ylab("Percent of kindergarteners that are completely Immunized")+
geom_boxplot(color="blue") +
ggtitle("Percentages of completely Immunized Students per County")+
coord_flip()
The box plot has too many counties and box plots to be able to actually read and understand anything.
ggplot(df) +
xlab("Percent of kindergarteners that are completely Immunized")+
ggtitle("Percentages of completely Immunized Students") +
geom_histogram(mapping=aes(percent),color="red",fill="blue",)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Looking at the percentages of Kindergarteners are are completely immunized, it is clear to see that most of the schools and counties have almost all of their Kindergarteners completely immunized, matching the five number summary from earlier.
students<-df %>%
select(county,percent)
bot5<- students %>%
group_by(county) %>%
#summarize(sum=sum(price)) %>%
arrange(percent) %>%
top_n(n=5)
## Selecting by percent
bot5
## # A tibble: 20,185 x 2
## # Groups: county [58]
## county percent
## <fct> <dbl>
## 1 Alpine 93.8
## 2 Mariposa 96.6
## 3 Mariposa 96.8
## 4 Sierra 97.1
## 5 Mariposa 97.4
## 6 Trinity 97.4
## 7 Trinity 97.4
## 8 Modoc 98.2
## 9 Modoc 98.3
## 10 Plumas 98.6
## # ... with 20,175 more rows
Cleaning the data a bit more so that we can see which 5 counties had the lowest immunization percentages.
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
plot <-bot5 %>%
filter(county == "Alpine"| county == "Mariposa" | county == "Sierra" | county == "Trinity" | county == "Modoc") %>%
ggplot() +
scale_y_continuous(limits=c(90,100),oob = rescale_none) +
geom_bar(aes(x=county, y=percent,), position = "dodge", stat = "identity")+
labs( xlab= "Counties" , ylab = "Percent", title = "Counties with lowest Immunization Percentage")
plot
The box plots all read 100% for the Kindergarteners that have been immunized, but looking at the table from above, it is clear to see that something is wrong with the code here.
plot2<- bot5 %>%
filter(county == "Alpine"| county == "Mariposa" | county == "Sierra" | county == "Trinity" | county == "Modoc") %>%
ggplot(df, mapping=aes(x = county, y= percent)) +
xlab("County") +
ylab("Percent of kindergarteners that are completely Immunized")+
geom_boxplot(color="blue") +
ggtitle("Percentages of completely Immunized Students per County")+
coord_flip()
plot2
The box plot here is much easier to read than the one from earlier, since the data was cleaned and we are only looking at 5 counties rather than all of them. Although the counties are all in the 90s, they are still the 5 counties with the lowest percentages.
Unfortunately, my final visualization added the percentages together for the selected counties for every year. Then I filtered out to only show the percentages for the year of 2015, but the percentages are still over 100, which does not make sense. I also could not edit the x axis and name it properly.
This data set focuses on the amount of Kindergarteners, in different districts and counties, in California, that are completely vaccinated and those that are not. The data set was taken from the google drive website. The variables I focused on are the complete amount of students, the amount of students completely immunized and the start year of enrollment. I chose this data set because it is a hot topic at the moment, with people arguing against vaccinating children, and it seemed like an interesting data set to dive into.
One reason why people are against vaccinating their children is because of their religious beliefs or philosophical reasons, which is very understandable (Calandrillo). Another key reason is the misinformation about risk, and overperception of risk (Calandrillo). Parents are looking online and read articles that display all the disadvantages and risks of vaccines, often rare cases, but does not show the benefits of the vaccine .
The data was cleaned by filtering and looking at only the 5 counties with the lowest immunization percentages. Looking at the overall picture, the amount of Kindergarteners that are completely vaccinated increases from 2001 to 2015. Some counties have 100% of the Kindergarteners completely vaccinated, while the others are very close to it. I imagine that if the data went up to the current year, some of those 100% might have been lowered into the upper 90s.
Works Cited: Calandrillo, Steve P. “Vanishing Vaccinations: Why Are so Many Americans Opting out of Vaccinating Their Children?” University of Michigan Journal of Law Reform. University of Michigan. Law School, U.S. National Library of Medicine, 2004, https://www.ncbi.nlm.nih.gov/pubmed/15568260.