1.) Read in Dataset

library(readr)
K5_df <- read_csv("kindergarten_CA.csv")
## Rows: 110382 Columns: 8
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): district, county, pub_priv, school
## dbl (4): sch_code, enrollment, complete, start_year
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(K5_df)

For this project, I used the California Kindergarten dataset created by Professor Rachel Saidi. I was able to retrieve this dataset from the course resources tab through a shared one drive link. The variables we will find in this dataset include district, school code, county, public or private, school (by name), year, enrollment, and completion. District, county, public or private, and school (by name) are all categorical variables. School code, year, enrollment, and completion are all quantitative variables. Enrollment, completion, and year are all continuous variables as well.

Summary Statistics for Kindergarden Dataset

My data set did not contain that many variables, therefore I pulled the summary statistics for the entire data set. The two quantitative variables that I chose to focus on were enrollment and completion rates.

summary(K5_df)
##    district            sch_code          county            pub_priv        
##  Length:110382      Min.   :   1501   Length:110382      Length:110382     
##  Class :character   1st Qu.:6019905   Class :character   Class :character  
##  Mode  :character   Median :6048706   Mode  :character   Mode  :character  
##                     Mean   :5879880                                        
##                     3rd Qu.:6134460                                        
##                     Max.   :9999999                                        
##                                                                            
##     school            enrollment        complete        start_year  
##  Length:110382      Min.   : 10.00   Min.   :  0.00   Min.   :2001  
##  Class :character   1st Qu.: 34.00   1st Qu.: 29.00   1st Qu.:2004  
##  Mode  :character   Median : 68.00   Median : 61.00   Median :2008  
##                     Mean   : 70.77   Mean   : 64.89   Mean   :2008  
##                     3rd Qu.: 98.00   3rd Qu.: 91.00   3rd Qu.:2012  
##                     Max.   :981.00   Max.   :973.00   Max.   :2015  
##                     NA's   :1652     NA's   :1652
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
LAK5_df <- K5_df %>%
  filter(county == "Los Angeles") %>%
  filter(enrollment > 150)

LAK5_df
## # A tibble: 2,003 x 8
##    district       sch_code county pub_priv school enrollment complete start_year
##    <chr>             <dbl> <chr>  <chr>    <chr>       <dbl>    <dbl>      <dbl>
##  1 Azusa Unified   6011282 Los A~ Public   LONGF~        255      227       2001
##  2 Baldwin Park ~  6011431 Los A~ Public   GEDDE~        166      146       2001
##  3 Castaic Union   6012033 Los A~ Public   CASTA~        196      188       2001
##  4 Compton Unifi~  6012355 Los A~ Public   AUGUS~        160      160       2001
##  5 Compton Unifi~  6012256 Los A~ Public   DICKI~        159      133       2001
##  6 Compton Unifi~  6012280 Los A~ Public   FOSTE~        178      150       2001
##  7 Compton Unifi~  6012306 Los A~ Public   KELLY~        226      217       2001
##  8 Compton Unifi~  6012314 Los A~ Public   KENNE~        154      129       2001
##  9 Compton Unifi~  6012389 Los A~ Public   THEOD~        263      263       2001
## 10 Downey Unified  6012744 Los A~ Public   ALAME~        191      171       2001
## # ... with 1,993 more rows

4.) Frequency Table

freqtableK5 <- table(LAK5_df$pub_priv)

freqtableK5
## 
## Private  Public 
##      10    1993

5.) Contigenecy Table

#table(LAK5_df$pub_priv, LAK5_df$school)

I did a tree mapping project with this data set a few weeks ago where I found out that most of the parents in San Bernardino County, California preferred private Montessori schools for their children’s education. With this project, I decided to look at all the schools in Los Angeles County. I quickly learned that Los Angeles County is a very populous county, so I began filtering out public and private schools with less than 150 kids enrolled by using the dplyr function.

LAPieK5 <- pie(table(LAK5_df$pub_priv))

California has about 58 different counties. Once I filtered out “Los Angeles” and “enrollment > 150” within my code syntax, I was still left with over 2,000 rows of data. I went on to create a frequency table based on public vs private schools in LA County. I did not create a valuable contingency table because of the dataset I chose to work with for this project. I did not have enough categorical variables to complete that part of the assignment. The two visualizations that I was able to code and create from this dataset, my pie chart, and the bar graph shows that PUBLIC school is either preferred or often the first choice in Los Angeles County.

LABarK5 <- barplot(table(LAK5_df$pub_priv),
main="Los Angeles Public and Private School Enrollment",
xlab="Private vs Public Enrollment",
ylab="Count",
col="blue",
density=10
)