1. Introduction

For this project, we will explore data from the California Department of Education in an effort to determine if California charter schools are better than California public schools.

We will attempt to evaluate this question by comparing the mean performance across charter schools and public schools across the following areas:

Suspension Rates
Graduation Rates
Dropout Rates
College Going Rates
Chronic Absenteeism Rates

For each of these areas, we will use a t-test to test the following hypothesis:

\(H_0: \mu_2 - \mu_1 = 0\)

\(H_a: \mu_2 - \mu_1 \ne 0\)

We will conduct these tests at a 95% level of significance.

Furthermore, once we have determined if there is a statistical signficance in the differences between the groups, we will aware a point to the school-type that has the better performance in this area, using the following convention:

Suspension Rates - For suspension rates, we will deem the school type that has the lower average suspension rate as being better

Graduation Rates - For graduation rates, we will deem the school type that has the higher average suspension rate as being better

Dropout Rates - For dropout rates, we will deem the school type that has the lower average dropout rate as being better

College Going Rates - For collage going rates, we will deem the school type that has the higher average college admission rate as being better

Chronic Absenteeism Rates- For chronic absenteeism rates, we will deem the school type that has the lower chronic absenteeism rate as being

Finally, the school type that has the most points will be considered the better school type.

A summary of our methodology is as follows:

Methodology:

Create dataframe by combining data files from the Department of Education website
Combine the various datasets into a singular data file
Calculate the average rate across all schools by school type
Compare the difference in the average rates to determine if they are statistically significant
If the difference is statistically significant, award a point to the school type wth the better average in the respective category
Determine which school type has the most point, and conclude that they are the better school-type

2. Import Data

We begin by importing the different data files from the California Department of Education website

Public Schools and Directory - Downloadable files containing general information about California’s public schools and districts.
Discipline File - Downloadable data about student discipline and the use of behavioral restraints and seclusion disaggregated by ethnicity, gender, program subgroup, and grade span.
Graduation and Dropout File - Downloadable data files of the Four-year Adjusted Cohort Graduation Rate (ACGR) and Outcome data reported by race/ethnicity, program subgroup, and gender.
Chronic Absenteeism File - Downloadable data files containing student absenteeism data by race/ethnicity, gender, program subgroup, and grade span. Chronic absenteeism counts, cumulative enrollment, and chronic absenteeism rate data are provided.
College-Going Rate for High School Completers File (16-month) - College-Going Rate (CGR) for California high school completers 16 months after high school completion reported by race/ethnicity, student group, and academic year.

schools_df_raw <- read_delim('https://www.cde.ca.gov/schooldirectory/report?rid=dl1&tp=txt')

## Rows: 18305 Columns: 47
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (47): CDSCode, NCESDist, NCESSchool, StatusType, County, District, Schoo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

susp_df_raw <- read_delim('https://www3.cde.ca.gov/demo-downloads/discipline/suspension22-v2.txt')

## Rows: 225353 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (20): AcademicYear, AggregateLevel, CountyCode, SchoolCode, CountyName, ...
## dbl  (1): DistrictCode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

grads_df_raw<- read_delim('https://www3.cde.ca.gov/demo-downloads/acgr/acgr22-v2.txt')

## Rows: 254175 Columns: 34
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (33): AcademicYear, AggregateLevel, CountyCode, SchoolCode, CountyName, ...
## dbl  (1): DistrictCode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ca_df_raw<- read_delim('https://www3.cde.ca.gov/demo-downloads/attendance/chronicabsenteeism22-v2.txt')

## Rows: 264937 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (12): Academic Year, Aggregate Level, County Code, School Code, County N...
## dbl  (1): District Code
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cgr_df_raw<- read_delim('https://www3.cde.ca.gov/demo-downloads/cgr/cgr16mo20.txt')

## Rows: 436467 Columns: 24
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (23): AcademicYear, AggregateLevel, CountyCode, SchoolCode, CountyName, ...
## dbl  (1): DistrictCode
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Make copies of dataframes

schools_df <- schools_df_raw
susp_df <- susp_df_raw
grads_df <- grads_df_raw
ca_df <- ca_df_raw
cgr_df <- cgr_df_raw

3. Data Wrangling

For this step, the focus is on preparing the different data frames so that they can be joined together and used downstream for our overall analysis.

a. Schools Directory

With the school’s directory data, we created a sub dataframe that only contains the columns we want. Additionally, we ensured that the cds_cods - the unique identifier for each school - is in the appropriate format for use when joining our other data frames.

schools_df <- clean_names(schools_df)

schools_mini <- schools_df %>% 
  select(cds_code, status_type, county, district, school, charter, eil_code)

schools_mini <- schools_mini %>% 
  mutate(cds_code = str_squish(format(as.numeric(cds_code),scientific=FALSE)))

b. Suspensions

For the suspension data, we focused on creating a sub dataframe that only contains school level data, and shows the suspension data for all students at each school. Additionally, we created a cds_code for each school, by combining the county code, district code, and school code for each school.

susp_df <- clean_names(susp_df)

susp_mini <- susp_df %>% select(aggregate_level, county_code, district_code, school_code,
                   reporting_category, suspension_rate_total)

susp_mini <- susp_mini %>% filter(aggregate_level == 'S',
                     reporting_category == 'TA')

susp_mini <- susp_mini %>% 
  mutate(cds_code = paste0(county_code, district_code, school_code)) %>% 
  mutate(cds_code = str_squish(format(as.numeric(cds_code), scientific = FALSE)))

susp_mini <- susp_mini %>% 
  select(cds_code, suspension_rate_total) %>%
  rename(suspension_rate = suspension_rate_total)

c. Graduates

For the Graduates data, we focused on creating a sub dataframe that only contains school level data, and shows the graduation and dropout data for all students at each school. Additionally, we created a cds_code for each school, by combining the county code, district code, and school code for each school.

grads_df <- clean_names(grads_df)

grads_mini <- grads_df %>% select(aggregate_level, county_code, district_code, school_code,
                    charter_school, dass, reporting_category, regular_hs_diploma_graduates_rate,
                    dropout_rate)

grads_mini <- grads_mini %>% 
  filter(aggregate_level == 'S', 
         dass == 'All', 
         reporting_category == 'TA', 
         charter_school=='All')

grads_mini <- grads_mini %>% 
  mutate(cds_code = paste0(county_code, district_code, school_code)) %>%
  mutate(cds_code = str_squish(format(as.numeric(cds_code),scientific=FALSE)))

grads_mini <- grads_mini %>% 
  select(cds_code, regular_hs_diploma_graduates_rate,dropout_rate) %>%
  rename(graduation_rate = regular_hs_diploma_graduates_rate)

d. Chronic Absenteeism

For the Chronic Absenteeism data, we focused on creating a sub dataframe that only contains school level data, and shows the chronic absenteeism data for all students at each school. Additionally, we created a cds_code for each school, by combining the county code, district code, and school code for each school.

ca_df <- clean_names(ca_df)

ca_mini <- ca_df %>% 
  select(aggregate_level, county_code, district_code, school_code, reporting_category,
         chronic_absenteeism_rate)

ca_mini <- ca_mini %>% 
  filter(aggregate_level == 'S', reporting_category == 'TA')

ca_mini <- ca_mini %>% 
  mutate(cds_code = paste0(county_code, district_code, school_code)) %>%
  mutate(cds_code = str_squish(format(as.numeric(cds_code),scientific=FALSE))) %>%
  select(cds_code, chronic_absenteeism_rate)

e. College Going Rate

Finally for the college going rates data, we focused on creating a sub dataframe that only contains school level data, and shows the college going rate data for all students at each school. Additionally, we created a cds_code for each school, by combining the county code, district code, and school code for each school.

cgr_df <- clean_names(cgr_df)

cgr_mini <- cgr_df %>% 
  select(aggregate_level, county_code, district_code, school_code, reporting_category, school_name,
         completer_type, alternative_school_accountability_status, college_going_rate_total_16_months,
         charter_school)

cgr_mini <- cgr_mini %>% filter(aggregate_level == 'S',
                                reporting_category == 'TA',
                                alternative_school_accountability_status == 'All',
                                completer_type == 'TA',
                                charter_school == 'All')

cgr_mini <- cgr_mini %>% 
  mutate(cds_code = paste0(county_code,district_code, school_code)) %>%
  mutate(cds_code = str_squish(format(as.numeric(cds_code),scientific=FALSE)))

cgr_mini <- cgr_mini %>% 
  select(cds_code, college_going_rate_total_16_months) %>%
  rename(college_going_rate = college_going_rate_total_16_months)

##. f Create joined data frame

Now that we have the necessary data from each file grouped into their separate sub dataframes, we are ready to combined the data into a single data frame that allows us to combine the performance data for each feature of interest with the data for the individual school.

Additionally, I relabeled the data associated with the school level that each school operates. This will make it easier to work with and to remove data for any grades that are not essential for our overall data.

merged_df <- left_join(schools_mini,susp_mini) %>%
  left_join(grads_mini) %>%
  left_join(ca_mini) %>%
  left_join(cgr_mini)

## Joining, by = "cds_code"
## Joining, by = "cds_code"
## Joining, by = "cds_code"
## Joining, by = "cds_code"

merged_df <- merged_df %>% 
  filter(charter != 'No Data')

merged_df <- merged_df %>% 
  mutate(eil_code = case_when(
    eil_code == 'A' ~ 'adult',
    eil_code == 'ELEM' ~ 'elementary',
    eil_code == 'ELEMHIGH' ~ 'elementary-high combo',
    eil_code == 'HS' ~ 'high school',
    eil_code == 'INTMIDJR' ~ 'intermediate/middle/junior high',
    eil_code == 'PS' ~ 'preschool',
    eil_code == 'UG' ~ 'ungraded')) %>%
  rename(school_level = eil_code)

4. Data Exploration

I began by exploring the overall merged data, to better understand the available information and to determine how the data is distributed between each school type

merged_df <- merged_df %>% 
  filter(school_level != 'adult', school_level != 'preschool', school_level != 'ungraded')


ggplot(merged_df) +
  geom_bar(aes(x=charter, fill=charter)) +
  ggtitle("Composition of Charter and Public Schools in California") +
  xlab("School Type") +
  ylab("Number of Schools")

merged_df %>% 
  count(charter) %>%
  kable(
    caption = "Number of Schools by Type" 
  ) %>%
  kable_material(c("striped"))

Number of Schools by Type
charter	n
N	11857
Y	2003

merged_df %>% 
  count(charter, school_level) %>% 
  spread(key=charter, value=n) %>%
  kable(
    caption = "Number of Schools by Type and School Level"
  ) %>%
  kable_material(c("striped"))

Number of Schools by Type and School Level
school_level	N	Y
elementary	6743	787
elementary-high combo	504	468
high school	2931	541
intermediate/middle/junior high	1679	207

merged_df %>% 
  filter(charter == 'Y') %>%
  group_by(school_level) %>%
  summarize(num_schools = n()) %>%
  mutate(
    pct_schools = round(num_schools/sum(num_schools),2)) %>% 
  kable(
    caption = "Number of Schools by School Level (Charter)",
    col.names = c("School Level", "N", "Pct of Total")
  ) %>%
  kable_material(c("striped"))

Number of Schools by School Level (Charter)
School Level	N	Pct of Total
elementary	787	0.39
elementary-high combo	468	0.23
high school	541	0.27
intermediate/middle/junior high	207	0.10

merged_df %>% 
  filter(charter == 'N') %>%
  group_by(school_level) %>%
  summarize(num_schools = n()) %>%
  mutate(
    pct_schools = round(num_schools/sum(num_schools),2)) %>% 
  kable(
    caption = "Number of Schools by School Level (Public)",
    col.names = c("School Level", "N", "Pct of Total")
  ) %>%
  kable_material(c("striped"))

Number of Schools by School Level (Public)
School Level	N	Pct of Total
elementary	6743	0.57
elementary-high combo	504	0.04
high school	2931	0.25
intermediate/middle/junior high	1679	0.14

merged_df <- merged_df %>% 
  mutate(suspension_rate = as.double(suspension_rate),
         graduation_rate = as.double(graduation_rate),
         dropout_rate = as.double(dropout_rate),
         chronic_absenteeism_rate = as.double(chronic_absenteeism_rate),
         college_going_rate = as.double(college_going_rate))

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

summary(merged_df)

##    cds_code         status_type           county            district        
##  Length:13860       Length:13860       Length:13860       Length:13860      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     school            charter          school_level       suspension_rate 
##  Length:13860       Length:13860       Length:13860       Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 0.200  
##  Mode  :character   Mode  :character   Mode  :character   Median : 1.200  
##                                                           Mean   : 3.062  
##                                                           3rd Qu.: 3.800  
##                                                           Max.   :87.500  
##                                                           NA's   :3855    
##  graduation_rate   dropout_rate     chronic_absenteeism_rate college_going_rate
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.00           Min.   : 0.00     
##  1st Qu.: 77.38   1st Qu.:  1.800   1st Qu.: 20.00           1st Qu.:33.30     
##  Median : 91.40   Median :  5.100   Median : 32.00           Median :57.10     
##  Mean   : 82.43   Mean   :  9.963   Mean   : 34.09           Mean   :53.89     
##  3rd Qu.: 96.00   3rd Qu.: 12.800   3rd Qu.: 45.10           3rd Qu.:73.50     
##  Max.   :100.00   Max.   :100.000   Max.   :100.00           Max.   :97.10     
##  NA's   :11564    NA's   :11564     NA's   :3955             NA's   :11844

merged_df %>%
  group_by(county, charter) %>%
  summarize(num_schools = n_distinct(cds_code)) %>% 
  spread(key=charter, value=num_schools) %>% 
  mutate(total_schools = N+Y) %>%
  arrange(desc(total_schools)) %>%
  ungroup() %>%
  kable(
    caption = "Number of Schools by County and School Type"
  ) %>%
  kable_material(c("striped"))

## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.

Number of Schools by County and School Type
county	N	Y	total_schools
Los Angeles	2249	536	2785
San Diego	810	218	1028
San Bernardino	661	87	748
Orange	700	43	743
Riverside	616	52	668
Alameda	480	92	572
Santa Clara	463	85	548
Sacramento	459	78	537
Fresno	413	72	485
Kern	334	21	355
San Joaquin	280	73	353
Contra Costa	320	29	349
Sonoma	186	79	265
Stanislaus	220	42	262
Tulare	237	25	262
Ventura	241	18	259
San Mateo	212	25	237
Monterey	181	18	199
Santa Barbara	154	24	178
San Francisco	153	23	176
Shasta	138	35	173
Placer	139	32	171
Humboldt	124	21	145
Solano	131	14	145
Butte	117	24	141
Merced	137	4	141
Madera	111	11	122
San Luis Obispo	117	4	121
Santa Cruz	99	17	116
Mendocino	96	16	112
El Dorado	81	21	102
Marin	91	5	96
Nevada	61	34	95
Yolo	82	11	93
Imperial	82	5	87
Kings	68	18	86
Siskiyou	78	8	86
Napa	67	2	69
Tehama	55	7	62
Sutter	48	13	61
Lake	55	4	59
Yuba	45	13	58
Tuolumne	49	6	55
Glenn	43	5	48
Lassen	39	9	48
Inyo	38	5	43
Calaveras	36	1	37
Trinity	32	3	35
San Benito	31	1	32
Mono	24	5	29
Plumas	28	1	29
Mariposa	24	1	25
Modoc	24	1	25
Del Norte	20	4	24
Amador	19	1	20
Sierra	15	1	16
Alpine	14	NA	NA
Colusa	30	NA	NA

merged_df %>%
  filter(charter == 'Y') %>%
  group_by(school_level) %>%
  summarize(n= n()) %>%
  ggplot() +
  geom_bar(aes(x=reorder(school_level, n), y=n, fill=school_level), stat='identity') +
  coord_flip() +
  ggtitle("Composition of Charter Schools by School Level") +
  xlab("") +
  ylab("")

merged_df %>%
  filter(charter == 'N') %>%
  group_by(school_level) %>%
  summarize(n= n()) %>%
  ggplot() +
  geom_bar(aes(x=reorder(school_level, n), y=n, fill=school_level), stat='identity') +
  coord_flip() +
  ggtitle("Composition of Public Schools by School Level") +
  xlab("") +
  ylab("")

5. Data Analysis

a. Suspension Rate

For suspension rate we will use a t-test to determine if the difference in the average school suspension rate between charter schools and public schools is statistically significant at the 95% level of significance

m1 <- merged_df %>% 
  select(charter, status_type, school_level, suspension_rate) %>%
  filter(!is.na(suspension_rate))
  

m1_summary <- m1 %>% 
  group_by(charter) %>% 
  summarize(num_schools = n(),
            avg = round(mean(as.double(suspension_rate)),2),
            sd = round(sd(as.double(suspension_rate)),2))

ggplot(m1,aes(x=suspension_rate,fill=charter)) +
  geom_density(alpha=0.3) +
  ggtitle("Distribution of School Suspension Rates by School Type") +
  xlab("School Suspension Rate") +
  ylab("Pct of Population")

ggplot(m1) +
  geom_boxplot(aes(x=charter, y=suspension_rate, fill=charter)) +
  ggtitle("Distribution of School Suspension Rates by School Type") +
  xlab("School Type") +
  ylab("Suspension Rates")

m1_summary %>%
  kable(
    caption = "Suspension Rates by School Type",
    col.names = c("Charter School", "N", "Sample Mean", "Sample Std. Dev")
  ) %>%
  kable_material(c("striped"))

Suspension Rates by School Type
Charter School	N	Sample Mean	Sample Std. Dev
N	8720	3.26	5.61
Y	1285	1.73	2.93

sr_charter <- (m1 %>% filter(charter == 'Y'))$suspension_rate
sr_public <- (m1 %>% filter(charter == 'N'))$suspension_rate

t.test(x=sr_charter, y=sr_public, alternative="two.sided",conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  sr_charter and sr_public
## t = -15.016, df = 2922.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.722124 -1.324312
## sample estimates:
## mean of x mean of y 
##  1.734397  3.257615

suspension_rate = c(1,0)

At the 95% level of significance, we have enough evidence to reject the null hypothesis that there is no difference in the average suspension rates between charter schools and pubic schools, and therefore we accept the alternative hypothesis that there is a difference in the suspensions rates between charter schools and public schools.

Furthermore, based on our data, we conclude that the suspension rate for charter schools is lower, and therefore better than the suspension rate at public schools. Therefore we conclude that Charter schools perform better than public schools with respect to suspension rates.

b. Graduation Rate

For graduation rate we will use a t-test to determine if the difference in the average school graduation rates between charter schools and public schools is statistically significant at the 95% level of significance

m2 <- merged_df %>% 
  select(charter, status_type, school_level, graduation_rate)


m2 <- m2 %>% 
  filter(school_level != 'adult',
         school_level != 'preschool',
         school_level != 'ungraded')


m2 <- m2 %>% 
  filter(!is.na(graduation_rate), 
         str_detect(graduation_rate, regex('[[:alnum:]]'))) %>%
  mutate(graduation_rate = as.double(graduation_rate))

m2_summary <- m2 %>% 
  group_by(charter) %>% 
  summarize(n = n(),
            avg = round(mean(graduation_rate),2),
            sd = round(sd(graduation_rate),2))

ggplot(m2,aes(x=graduation_rate, fill=charter)) +
  geom_density(alpha=0.3) +
  ggtitle("Distribution of School Graduation Rates by School Type") +
  xlab("School Graduation Rate") +
  ylab("Pct of Population")

ggplot(m2) +
  geom_boxplot(aes(x=charter, y=graduation_rate, fill=charter)) +
  ggtitle("Distribution of School Graduation Rates by School Type") +
  xlab("School Type") +
  ylab("Graduation Rate")

m2_summary %>%
  kable(
    caption = "Graduation Rates by School Type",
    col.names = c("Charter School", "N", "Sample Mean", "Sample Std. Dev")
  ) %>%
  kable_material(c("striped"))

Graduation Rates by School Type
Charter School	N	Sample Mean	Sample Std. Dev
N	1759	83.43	20.07
Y	537	79.18	25.65

gr_charter <- (m2 %>% filter(charter == 'Y'))$graduation_rate
gr_public <- (m2 %>% filter(charter == 'N'))$graduation_rate

t.test(x=gr_charter, y=gr_public, alternative="two.sided",conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  gr_charter and gr_public
## t = -3.5241, df = 747.07, p-value = 0.0004507
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.616605 -1.882264
## sample estimates:
## mean of x mean of y 
##  79.17598  83.42541

graduation_rate = c(0,1)

At the 95% level of significance, we have enough evidence to reject the null hypothesis that there is no difference in the graduation rates between charter schools and pubic schools, and therefore we accept the alternative hypothesis that there is a difference in the average graduation rates between charter schools and public schools.

Furthermore, based on our data, we conclude that the graduation rate for charter schools is lower, and therefore worse than than the graduation rate at public schools. Therefore, we conclude that Public schools perform better than charter schools with respect to graduation rates.

c. Dropout Rate

For dropout rate we will use a t-test to determine if the difference in the average school dropout rate between charter schools and public schools is statistically significant at the 95% level of significance

m3 <- merged_df %>% 
  select(charter, status_type, school_level, dropout_rate)

m3 %>% 
  count(charter, school_level) %>%
  spread(key=charter, value = n) %>%
  kable() %>%
  kable_material(c("striped"))

school_level	N	Y
elementary	6743	787
elementary-high combo	504	468
high school	2931	541
intermediate/middle/junior high	1679	207

m3 <- m3 %>% 
  filter(school_level != 'adult',
         school_level != 'preschool',
         school_level != 'ungraded')

m3 <- m3 %>% 
  filter(!is.na(dropout_rate), 
         str_detect(dropout_rate, regex('[[:alnum:]]'))) %>%
  mutate(dropout_rate = as.double(dropout_rate))

ggplot(m3,aes(x=dropout_rate, fill=charter)) +
  geom_density(alpha=0.3) + 
  ggtitle("Distribution of School Dropout Rates by School Type") +
  xlab("School Dropout Rate") +
  ylab("Pct of Population")

ggplot(m3) +
  geom_boxplot(aes(x=charter, y=dropout_rate, fill=charter)) +
  ggtitle("Distribution of School Dropout Rates by School Type") +
  xlab("School Type") +
  ylab("Dropout Rate")

m3_summary <- m3 %>% 
  group_by(charter) %>%
  summarize(n = n(),
            avg = round(mean(dropout_rate),2),
            sd = round(sd(dropout_rate),2))

m3_summary %>%
  kable(
    caption = "Dropout Rates by School Type",
    col.names = c("Charter School", "N", "Sample Mean", "Sample Std. Dev")
  ) %>%
  kable_material(c("striped"))

Dropout Rates by School Type
Charter School	N	Sample Mean	Sample Std. Dev
N	1759	9.27	11.90
Y	537	12.22	16.64

dr_charter <- (m3 %>% filter(charter == 'Y'))$dropout_rate
dr_public <- (m3 %>% filter(charter == 'N'))$dropout_rate

t.test(x=dr_charter, y=dr_public, alternative="two.sided",conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  dr_charter and dr_public
## t = 3.8232, df = 711.11, p-value = 0.0001433
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.435686 4.466676
## sample estimates:
## mean of x mean of y 
## 12.223836  9.272655

dropout_rate = c(0,1)

At the 95% level of significance, we have enough evidence to reject the null hypothesis that there is no difference in the average dropout rates between charter schools and pubic schools, and therefore we accept the alternative hypothesis that there is a difference in the average dropout rates between charter schools and public schools.

Furthermore, based on our data, we conclude that the dropout rate for public schools is lower, and therefore better than the dropout rate at charter schools. Therefore we conclude that Public schools perform better than charter schools with respect to dropout rates.

d. College Going Rate

For college going rate we will use a t-test to determine if the difference in the average school college going rate between charter schools and public schools (high school and elementary-high combo schools) is statistically significant at the 95% level of significance

m4 <- merged_df %>% 
  select(charter, status_type, school_level, college_going_rate)


m4 <- m4 %>% 
  filter(school_level %in% c('high school', 
                             'elementary-high combo'))

m4 <- m4 %>% 
  filter(!is.na(college_going_rate), 
         str_detect(college_going_rate, regex('[[:alnum:]]'))) %>%
  mutate(college_going_rate = as.double(college_going_rate))

m4_summary <- m4 %>% 
  group_by(charter) %>%
  summarize(n = n(),
            avg = round(mean(college_going_rate),2),
            sd = round(sd(college_going_rate),2))


ggplot(m4,aes(x=college_going_rate, fill=charter)) +
  geom_density(alpha=0.3) +
  ggtitle("Distribution of School College Going Rates by School Type") +
  xlab("School Dropout Rate") +
  ylab("Pct of Population")

ggplot(m4) +
  geom_boxplot(aes(x=charter, y=college_going_rate, fill=charter)) +
  ggtitle("Distribution of School College Going Rates by School Type") +
  xlab("School Type") +
  ylab("College Going Rate")

m4_summary %>%
  kable(
    caption = "College Going Rates by School Type",
    col.names = c("Charter School", "N", "Sample Mean", "Sample Std. Dev")
  ) %>%
  kable_material(c("striped"))

College Going Rates by School Type
Charter School	N	Sample Mean	Sample Std. Dev
N	1535	54.02	24.48
Y	480	53.47	22.28

cgr_charter <- (m4 %>% filter(charter == 'Y'))$college_going_rate
cgr_public <- (m4 %>% filter(charter == 'N'))$college_going_rate

t.test(x=cgr_charter, y=cgr_public, alternative="two.sided",conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  cgr_charter and cgr_public
## t = -0.45707, df = 870.08, p-value = 0.6477
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.888105  1.797037
## sample estimates:
## mean of x mean of y 
##  53.47479  54.02033

college_going_rate = c(0,0)

At the 95% level of significance, we are unable to reject the null hypothesis, and therefore we conclude that there is no difference between Charter schools and Public Schools with respect to college going rates

e. Chronic Absenteeism Rate

Finally, for chronic absenteeism rate we will use a t-test to determine if the difference in the average school chronic absenteeism rate between charter schools and public schools is statistically significant at the 95% level of significance

m5 <- merged_df %>% 
  select(charter, status_type, school_level, chronic_absenteeism_rate)

m5 %>% 
  count(charter, school_level) %>%
  spread(key=charter, value = n) %>%
  kable() %>%
  kable_material(c("striped"))

school_level	N	Y
elementary	6743	787
elementary-high combo	504	468
high school	2931	541
intermediate/middle/junior high	1679	207

m5 <- m5 %>% 
  filter(school_level != 'adult',
         school_level != 'preschool',
         school_level != 'ungraded')

m5 <- m5 %>% 
  filter(!is.na(chronic_absenteeism_rate), 
         str_detect(chronic_absenteeism_rate, regex('[[:alnum:]]'))) %>%
  mutate(chronic_absenteeism_rate = as.double(chronic_absenteeism_rate))

m5_summary <- m5 %>% 
  group_by(charter) %>%
  summarize(n = n(),
            avg = round(mean(chronic_absenteeism_rate),2),
            sd = round(sd(chronic_absenteeism_rate),2))

ggplot(merged_df,aes(x=chronic_absenteeism_rate, fill=charter)) +
  geom_density(alpha=0.3) +
  ggtitle("Distribution of Chronic Absenteeism Rates by School Type") +
  xlab("School Chronic Absenteeism Rate") +
  ylab("Pct of Population")

## Warning: Removed 3955 rows containing non-finite values (`stat_density()`).

ggplot(merged_df) +
  geom_boxplot(aes(x=charter, y=chronic_absenteeism_rate, fill=charter)) +
  ggtitle("Distribution of Chronic Absenteeism Rates by School Type") +
  xlab("School Type") +
  ylab("Chronic Absenteeism Rate")

## Warning: Removed 3955 rows containing non-finite values (`stat_boxplot()`).

m5_summary %>%
  kable(
    caption = "Chronic Absenteeism Rate by School Type",
    col.names = c("Charter School", "N", "Sample Mean", "Sample Std. Dev")
  ) %>%
  kable_material(c("striped"))

Chronic Absenteeism Rate by School Type
Charter School	N	Sample Mean	Sample Std. Dev
N	8622	35.15	19.37
Y	1283	26.95	18.41

ca_charter <- (m5 %>% filter(charter == 'Y'))$chronic_absenteeism_rate
ca_public <- (m5 %>% filter(charter == 'N'))$chronic_absenteeism_rate

t.test(x=ca_charter, y=ca_public, alternative="two.sided",conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  ca_charter and ca_public
## t = -14.78, df = 1732.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.287698 -7.111556
## sample estimates:
## mean of x mean of y 
##  26.94981  35.14943

chronic_absenteeism_rate = c(1,0)

At the 95% level of significance, we have enough evidence to reject the null hypothesis that there is no difference in the average chronic absenteeism rates between charter schools and pubic schools, and therefore we accept the alternative hypothesis that there is a difference in the average chronic absenteeism rates between charter schools and public schools.

Furthermore, based on our data, we conclude that the chronic absenteeism rate for charter schools is lower, and therefore better than the chronic absenteeism rate at public schools. Thus, we conclude that charter schools perform better than public schools with respect to chronic absenteeism.

6. Summary Data and Conclusion

Across each of the categories that we used to compare the performance of charter schools vs. public schools, we find that charter schools outperformed public schools in 2 of the 5 categories, while public schools outperformed charter schools in 2 of 5 categories.

category <- c("suspension_rate", "graduation_rate", "dropout_rate", "college_going_rate", 
              "chronic_absenteeism_rate")
charter <- c(1,0,0,0,1)
public <- c(0,1,1,0,0)

summary_tab <- data.frame(category, charter, public)



summary_tab %>%
  kable(
    caption = 'Charter School vs. Public Schools',
    col.names = (c("Category", "Charter", "Public"))
  ) %>%
  kable_material(c("striped"))

Charter School vs. Public Schools
Category	Charter	Public
suspension_rate	1	0
graduation_rate	0	1
dropout_rate	0	1
college_going_rate	0	0
chronic_absenteeism_rate	1	0

Based on these results, we were unable to support our assumption that charter schools are better than public schools in California.

However, there are several known limitations with using this analysis to determine if charter schools are better than public schools. Some of these limitations include:

Based on the way we summarized the results from each of the different categories of analysis, we essentially are concluding that each category is equally-weighted for the purposes of calculating our overall ranking. While it’s possible that an individual interested in comparing the two school types may agree that each of these categories are qualitatively the same in terms of their choice, additional research should be done to come to this conclusion.
Additionally, there are a number of other features that can and should be used to measure the differences between public and charter schools. Amongst them include: performance on standardized tests, extra-curriculur activities, diversity of staff, student diversity. Although one could argue that the performance in each of these other areas would show up in areas such as graduation rates and college-going rates, it’s still worth noting that they are absent in our evaluation tool.
Finally, this analysis was based on relative performance between school types for all student. However, this doesn’t take into account the fact that one of the missions of charter schools is often focused on beting a better option for undermarginalized populations. As a result, it’s possible that if we were to further evaluate performance amongst different sub-cohorts of student populations, we may generate a different conclusion.

Overall, I believe that this research project allows us to recognize that there are no clear cut answers to the question of whether or not charter schools are better than public schools.

Appendix:

The documentation for each of the datasets are located here

Public Schools and Districts - https://www.cde.ca.gov/ds/si/ds/fspubschls.asp
Chronic Absenteeism - https://www.cde.ca.gov/ds/ad/fsabd.asp
Suspensions - https://www.cde.ca.gov/ds/ad/fssd.asp
Dropout - https://www.cde.ca.gov/ds/ad/fsacgr.asp
Post-Secondary Enrollment - https://www.cde.ca.gov/ds/ad/fscgr16.asp

DATA 606 - Final Project

Kory Martin

2023-05-13