title: “IS360Assignment7” author: “Sam CD” date: “October 21, 2014” output: html_document —
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
For this project, I set out to inspect the issue of class size in New York Public Schools. Much debate has been made as to the early years being the most crucial part of a child’s education. Class size is often mentioned as a factor and so, while unable to prove whether larger or smaller classes were better, I chose to find outliers in class size data using the City of New York Data Catalog at data.gov.
I put the following data set into a table called schools - 2007-08_Class_Size_-School-levelDetail.csv and then filtered it out to only include Kindergarten classes:
k <- filter(schools, grepl("K", GRADE))
k <- filter(k, !grepl("-", GRADE))
k <- select(k, -(CORE.SUBJECT.9.12.ONLY.:SERVICE.CATEGORY.K.9.ONLY.))
The following code creates variables by which we can find a mean class size for kindergarten classes throughout New York City and then find out which classes are farthest from that mean. It would also show which classes have the greates discrepancy between smallest and largest class size and then create a histogram of these two variables.
Looking at the histogram, we can see that the data is normally distributed, and that the majority of the classes sit around 0. However, there are a few classes with a difference of greater than 5 and more than 5 away from the mean. So our final table, narrowed down from over 15,000 observations to just 200, gives us the schools that we can focus on in order to more evenly distribute the students.
The next steps would perhaps be to analyze the proximity of the schools and match them with schools that have opposite needs (i.e. class sizes too large vs. class sizes too small).