1.) Read in Dataset
library(readr)
K5_df <- read_csv("kindergarten_CA.csv")
## Rows: 110382 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): district, county, pub_priv, school
## dbl (4): sch_code, enrollment, complete, start_year
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(K5_df)
For this project, I used the California Kindergarten dataset created by Professor Rachel Saidi. I was able to retrieve this dataset from the course resources tab through a shared one drive link. The variables we will find in this dataset include district, school code, county, public or private, school (by name), year, enrollment, and completion. District, county, public or private, and school (by name) are all categorical variables. School code, year, enrollment, and completion are all quantitative variables. Enrollment, completion, and year are all continuous variables as well.
Summary Statistics for Kindergarden Dataset
My data set did not contain that many variables, therefore I pulled the summary statistics for the entire data set. The two quantitative variables that I chose to focus on were enrollment and completion rates.
summary(K5_df)
## district sch_code county pub_priv
## Length:110382 Min. : 1501 Length:110382 Length:110382
## Class :character 1st Qu.:6019905 Class :character Class :character
## Mode :character Median :6048706 Mode :character Mode :character
## Mean :5879880
## 3rd Qu.:6134460
## Max. :9999999
##
## school enrollment complete start_year
## Length:110382 Min. : 10.00 Min. : 0.00 Min. :2001
## Class :character 1st Qu.: 34.00 1st Qu.: 29.00 1st Qu.:2004
## Mode :character Median : 68.00 Median : 61.00 Median :2008
## Mean : 70.77 Mean : 64.89 Mean :2008
## 3rd Qu.: 98.00 3rd Qu.: 91.00 3rd Qu.:2012
## Max. :981.00 Max. :973.00 Max. :2015
## NA's :1652 NA's :1652
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
LAK5_df <- K5_df %>%
filter(county == "Los Angeles") %>%
filter(enrollment > 150)
LAK5_df
## # A tibble: 2,003 x 8
## district sch_code county pub_priv school enrollment complete start_year
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Azusa Unified 6011282 Los A~ Public LONGF~ 255 227 2001
## 2 Baldwin Park ~ 6011431 Los A~ Public GEDDE~ 166 146 2001
## 3 Castaic Union 6012033 Los A~ Public CASTA~ 196 188 2001
## 4 Compton Unifi~ 6012355 Los A~ Public AUGUS~ 160 160 2001
## 5 Compton Unifi~ 6012256 Los A~ Public DICKI~ 159 133 2001
## 6 Compton Unifi~ 6012280 Los A~ Public FOSTE~ 178 150 2001
## 7 Compton Unifi~ 6012306 Los A~ Public KELLY~ 226 217 2001
## 8 Compton Unifi~ 6012314 Los A~ Public KENNE~ 154 129 2001
## 9 Compton Unifi~ 6012389 Los A~ Public THEOD~ 263 263 2001
## 10 Downey Unified 6012744 Los A~ Public ALAME~ 191 171 2001
## # ... with 1,993 more rows
4.) Frequency Table
freqtableK5 <- table(LAK5_df$pub_priv)
freqtableK5
##
## Private Public
## 10 1993
5.) Contigenecy Table
#table(LAK5_df$pub_priv, LAK5_df$school)
I did a tree mapping project with this data set a few weeks ago where I found out that most of the parents in San Bernardino County, California preferred private Montessori schools for their children’s education. With this project, I decided to look at all the schools in Los Angeles County. I quickly learned that Los Angeles County is a very populous county, so I began filtering out public and private schools with less than 150 kids enrolled by using the dplyr function.
LAPieK5 <- pie(table(LAK5_df$pub_priv))
California has about 58 different counties. Once I filtered out “Los Angeles” and “enrollment > 150” within my code syntax, I was still left with over 2,000 rows of data. I went on to create a frequency table based on public vs private schools in LA County. I did not create a valuable contingency table because of the dataset I chose to work with for this project. I did not have enough categorical variables to complete that part of the assignment. The two visualizations that I was able to code and create from this dataset, my pie chart, and the bar graph shows that PUBLIC school is either preferred or often the first choice in Los Angeles County.
LABarK5 <- barplot(table(LAK5_df$pub_priv),
main="Los Angeles Public and Private School Enrollment",
xlab="Private vs Public Enrollment",
ylab="Count",
col="blue",
density=10
)