You get to decide which dataset you want to work on. THe data set must be different You can work on a problem from your work, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portal. Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - describe the problem you are trying to solve. - describe your datasets and what you did to prepare the data for analysis. - methodologies you used for analyzing the data - why you did what you did - make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution. Your final presentation could be the traditional R file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.
The topic of selecting a middle school was a selfish one for me because my child is set to attend middle school in New York City. NYC’s public school system is difficult to navigate and as a parent I sought to select a middle school with an increased likelihood of students attending a specialized high school. The main problem I wish to solve is to select a middle school for my child to attend with an increased likelihood of students attending a specialized high school.
Every year over 25,000 eighth grade students in New York City’s public school system take the arduous Specialized High School Admissions Test (SHSAT) to gain admittance to eight prestigious specialized high schools: Bronx High School of Science, Brooklyn Latin School, Brooklyn Technical High School, the High School for Mathematics, Science and Engineering at City College of New York, High School of American Studies at Lehman College, Queens High School for the Sciences at York College, Staten Island Technical High School, and Stuyvesant High School. Yet with a population of over 1 million students, the largest system in the United States, is there equity in the system to allow admittance? It is the plan of this research to look at the data for New York City’s school system for the years of 2018 to 2021 with regards to middle schools and the SHSAT. The system is divided into 32 school districts. This study will examine middle school data to see if there is a correlation between the number of offers given for specialized high school based on the SHSAT and an individual school’s academic data, programs, and district assignment. It is the hope of this study to identify trends for improving success for schools and to, if any, to determine ways to improve schools that have few offers.
The following data sets were used for this homework.
Looking at the NYC Open Data, the data for middle schools proved to
only be viable for 2018 to 2021. In addition, the data sources would
report a range of 0-5 for offers. Hence, if a middle school received 0
offers for students, the row would have a value of “0 to 5”. I decided
to list such values as 0 in addition to removing middle schools with
“NA” values for student counts: the reason is that in some years the
middle school may not have reported the data or did not exist.
Later in the exploration of models, I found that issues with
overdispersion and would need to employ zero-inflation modelling
methods.
In visualizing and exploring the data, I wanted to be able to identify high performing middle schools by examining them on a map and plotting their SHSAT offers (offers made by specialized high schools to students). By also creating some feature variables, I would examine a correlation matrix for the variables and see to identify what makes a good middle school for producing students with a high probability of being given an offer by a specialized high school. Next, I would use different models, a decision tree, multinomial linear regression, and knn clustering to see if there are possible policies that can be implemented to increase the likelihood of students receiving offers for specialized high schools.
For the exploratory data analysis, I decided early to create maps
that would help me grapple with the location of middle schools and their
SHSAT performance. The following were coded in Python and R using
Plotly and Shiny libraries.
***
The following code imports that data sets that were created after cleaning.
# School information: Note: only 2018-2021 are available
school_info <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw4/data/2018-2021_school_information.csv')
## Rows: 1906 Columns: 17
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (11): schooldbn, name, neighborhood, address, totalstudents, ellprograms...
## dbl (6): district, coursepassrate, elaprof, mathprof, surveysafety, year
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Remove NA values for total_students, course pass rate, elaprof, and mathprof
school_info <- school_info %>% drop_na(totalstudents)
school_info <- school_info %>% drop_na(coursepassrate)
school_info <- school_info %>% drop_na(elaprof)
school_info <- school_info %>% drop_na(mathprof)
# Convert totalstudents to numberical
school_info$totalstudents <- as.numeric(as.character(school_info$totalstudents))
## Warning: NAs introduced by coercion
# Potential features
## Dual Program feature: If the school has a dual language program
school_info$dual_program <- ifelse(str_detect(school_info$ellprograms, "Dual") & (is.na(school_info$ellprograms) == FALSE), 1, 0)
## Specialized Test Prep feature: If the school has a specialized high school prep class
school_info$shsat_prep <- ifelse(str_detect(school_info$electiveclasses, "Specialized High School Test") & (is.na(school_info$electiveclasses) == FALSE), 1, 0)
# School offers have the offers for middle schools: Note 2016-2021 are available but for the previous data set only 2018-2021 are available
# Remove rows with NA values
school_offers <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw4/data/school_offers.csv')
## Rows: 470 Columns: 28
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): dbn, name, telephone, address, Borough, url
## dbl (22): district, Postcode, Latitude, Longitude, 2016_student_count, 2016_...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
school_offers$Postcode <- as.character(school_offers$Postcode)
school_offers <- subset(school_offers, select = -c(`name`, `telephone`, `address`, `2016_student_count`, `2016_testers_count`,`2016_offers_count`,`2017_student_count`,`2017_testers_count`,`2017_offers_count`))
school_offers <- na.omit(school_offers)
temp1 <- subset(school_info, select = c(`district`, `schooldbn`, `name`, `year`, `coursepassrate`, `accessibility`, `elaprof`, `mathprof`, `totalstudents`, `surveysafety`,`dual_program`, `shsat_prep`))
temp2018 <- merge(temp1[temp1$year == 2018, ], subset(school_offers, select = c(`dbn`, `2018_student_count`, `2018_testers_count`, `2018_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2018)[names(temp2018) == '2018_student_count'] <- 'student_count'
names(temp2018)[names(temp2018) == '2018_testers_count'] <- 'testers_count'
names(temp2018)[names(temp2018) == '2018_offers_count'] <- 'offers_count'
temp2019 <- merge(temp1[temp1$year == 2019, ], subset(school_offers, select = c(`dbn`, `2019_student_count`, `2019_testers_count`, `2019_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2019)[names(temp2019) == '2019_student_count'] <- 'student_count'
names(temp2019)[names(temp2019) == '2019_testers_count'] <- 'testers_count'
names(temp2019)[names(temp2019) == '2019_offers_count'] <- 'offers_count'
temp2020 <- merge(temp1[temp1$year == 2020, ], subset(school_offers, select = c(`dbn`, `2020_student_count`, `2020_testers_count`, `2020_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2020)[names(temp2020) == '2020_student_count'] <- 'student_count'
names(temp2020)[names(temp2020) == '2020_testers_count'] <- 'testers_count'
names(temp2020)[names(temp2020) == '2020_offers_count'] <- 'offers_count'
temp2021 <- merge(temp1[temp1$year == 2021, ], subset(school_offers, select = c(`dbn`, `2021_student_count`, `2021_testers_count`, `2021_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2021)[names(temp2021) == '2021_student_count'] <- 'student_count'
names(temp2021)[names(temp2021) == '2021_testers_count'] <- 'testers_count'
names(temp2021)[names(temp2021) == '2021_offers_count'] <- 'offers_count'
# Create model data and concatenate the temp data frames
model_data <- rbind(temp2018, temp2019)
model_data <- rbind(model_data, temp2020)
model_data <- rbind(model_data, temp2021)
# Create rate_offers: number of offers / number of testers
model_data$rate_offers <- model_data$offers_count / model_data$testers_count
model_data$rate_offers[is.na(model_data$rate_offers)] <- 0
# Temp variables for schools
schools2018 <- model_data[model_data$year == 2018, ]
schools2019 <- model_data[model_data$year == 2019, ]
schools2020 <- model_data[model_data$year == 2020, ]
schools2021 <- model_data[model_data$year == 2021, ]
The following tabs represent a visual analysis of the data for the years 2018 to 2021. The top 25 middle schools with the highest rate of offers to specialized high schools for each year is also listed. Notable schools such as the Christa McAuliffe School would consistently be at the top for high performing middle schools.
print(dfSummary(schools2018, graph.magnif=0.75), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
16 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
33 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
73 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
79 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
172 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
40 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
130 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
92 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
39 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
62 distinct values | 207 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04
ggplot( head(arrange(schools2018, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
print(dfSummary(schools2019, graph.magnif=0.75), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
32 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
41 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
454 (99.8%) | 1 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
88 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
96 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
362 distinct values | 454 (99.8%) | 1 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
44 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
217 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
131 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
51 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
111 distinct values | 455 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04
ggplot( head(arrange(schools2019, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
print(dfSummary(schools2020, graph.magnif=0.75), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
32 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
41 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
84 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
91 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
366 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
42 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] | 1 distinct value |
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
210 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
133 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
48 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
104 distinct values | 456 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04
ggplot( head(arrange(schools2020, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
print(dfSummary(schools2021, graph.magnif=0.75), method="render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
32 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
41 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
87 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
93 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
380 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
41 distinct values | 457 (99.8%) | 1 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
209 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
121 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
49 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
98 distinct values | 458 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04
ggplot( head(arrange(schools2021, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
The following creates training and testing partitions for the data
and constructs a decision tree. From the correlation plot, I decided to
use
offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate.
set.seed(422)
data_split <- initial_split(model_data, prop = 0.8, strata = district)
schools_train <- training(data_split)
schools_test <- testing(data_split)
model_spec <- decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
model_spec
## Decision Tree Model Specification (regression)
##
## Computational engine: rpart
# Train the model
model <- model_spec %>%
fit(offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate,
data = schools_train)
# Information about the model
model
## parsnip model object
##
## n= 1260
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1260 637081.600 8.1722220
## 2) testers_count< 198.5 1195 180419.700 4.4702930
## 4) testers_count< 100.5 1089 42853.570 2.1698810
## 8) mathprof< 64.5 990 11025.070 0.9050505 *
## 9) mathprof>=64.5 99 14406.730 14.8181800 *
## 5) testers_count>=100.5 106 72597.860 28.1037700
## 10) mathprof< 88.5 94 25154.000 21.0000000
## 20) mathprof< 53.5 56 3919.982 12.5178600 *
## 21) mathprof>=53.5 38 11267.500 33.5000000 *
## 11) mathprof>=88.5 12 5542.250 83.7500000 *
## 3) testers_count>=198.5 65 139207.500 76.2307700
## 6) elaprof< 82.5 53 44108.750 60.2830200
## 12) mathprof< 62.5 20 7415.200 33.2000000 *
## 13) mathprof>=62.5 33 13132.970 76.6969700 *
## 7) elaprof>=82.5 12 22084.670 146.6667000 *
# A plot of the model
model$fit %>% rpart.plot(type = 4, roundint=FALSE)
After building the decision tree model based upon the variables, we will take the mean square error.
# Declare Mean Absolute Error function
MAE <- function(actual, predicted){
mean(abs(actual - predicted))
}
# Generate the predictions using the test data
predictions <- predict(model, new_data = schools_test)
# Truncate the preductions
predictions$.pred <- trunc(predictions$.pred)
# Calculate the mean absolute error
mae_decisiontree_model <- MAE(schools_test$offers_count, predictions$.pred)
The mean absolute error of the decision tree is 3.2025316.
This first model looks to build a traditional regression using the same variables as the decision tree.
first_model <- lm(offers_count ~ coursepassrate + elaprof + mathprof + surveysafety, data=schools_train)
summary(first_model)
##
## Call:
## lm(formula = offers_count ~ coursepassrate + elaprof + mathprof +
## surveysafety, data = schools_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.759 -8.576 -0.876 4.711 175.620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.31048 7.22798 1.842 0.0658 .
## coursepassrate -0.01141 0.06435 -0.177 0.8593
## elaprof -0.15438 0.07322 -2.108 0.0352 *
## mathprof 0.70883 0.06593 10.751 < 2e-16 ***
## surveysafety -0.27439 0.06365 -4.311 1.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.33 on 1254 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3386, Adjusted R-squared: 0.3365
## F-statistic: 160.5 on 4 and 1254 DF, p-value: < 2.2e-16
# To see the performance metrics in an orderly data frame
glance(first_model)
# To see the R-squared
wrapFTest(first_model)
## [1] "F Test summary: (R2=0.3386, F(4,1254)=160.5, p<1e-05)."
## Evaluation R-2
firstmodelr_2 <- summary(first_model)$r.squared
## Use the model on the testing data
## Create the prediction column based on the model
schools_test$p1 <- predict(first_model, newdata=schools_test)
## Plot to compare the predictions to actual prediction on the x axis
ggplot(schools_test, aes(x=p1, y=offers_count)) + geom_point() + geom_abline(color="blue")
# Calculate the RMSE
first_rmse <- schools_test %>%
mutate(residual = p1 - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
# Get R-squared from glance and print it
(rsq_glance <- glance(first_model)$r.squared)
## [1] 0.3385638
The first model’s RMSE is
This second model uses quasipoisson for the regression because the
variance is greater than the mean for offers_count. Because
this is a quasipoisson, an AIC score cannot be generated. However, a
RMSE score is created.
# Quasipoisson variance is greater than the mean for offers_count
glm_second_model <- glm(offers_count ~ testers_count + coursepassrate + elaprof + mathprof + surveysafety, data=schools_train, family=quasipoisson)
summary(glm_second_model)
##
## Call:
## glm(formula = offers_count ~ testers_count + coursepassrate +
## elaprof + mathprof + surveysafety, family = quasipoisson,
## data = schools_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -14.6335 -1.6865 -1.0574 -0.5426 11.0786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.0429990 1.1490305 -4.389 1.23e-05 ***
## testers_count 0.0076113 0.0002519 30.220 < 2e-16 ***
## coursepassrate 0.0622752 0.0122191 5.097 3.99e-07 ***
## elaprof -0.0178286 0.0048805 -3.653 0.00027 ***
## mathprof 0.0573938 0.0046112 12.447 < 2e-16 ***
## surveysafety -0.0212445 0.0040421 -5.256 1.73e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 7.289936)
##
## Null deviance: 37110.5 on 1258 degrees of freedom
## Residual deviance: 6561.6 on 1253 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
schools_test$p2 <- predict(glm_second_model, newdata=schools_test)
# Calculate the RMSE
second_rmse <- schools_test %>%
mutate(residual = p2 - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p2, y = offers_count)) +
geom_point() +
geom_abline(color = "darkblue")
For this third model, I will use a zero-inflated model. Firstly, a test is done to see if one is needed. Secondly, if one is needed, then a third model is created.
# Zero Inflation test for glm_second_model
check_zeroinflation(glm_second_model)
## # Check for zero-inflation
##
## Observed zeros: 934
## Predicted zeros: 408
## Ratio: 0.44
## Model is underfitting zeros (probable zero-inflation).
zip_model <- zeroinfl(offers_count ~ testers_count + coursepassrate + elaprof + mathprof + surveysafety, data = schools_train)
schools_test$p3 <- predict(zip_model, newdata=schools_test)
# Calculate the RMSE
third_rmse <- schools_test %>%
mutate(residual = p2 - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p2, y = offers_count)) +
geom_point() +
geom_abline(color = "darkblue")
summary(zip_model)
##
## Call:
## zeroinfl(formula = offers_count ~ testers_count + coursepassrate + elaprof +
## mathprof + surveysafety, data = schools_train)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -8.39681 -0.22161 -0.07272 -0.03212 8.63169
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.720e-01 3.499e-01 2.778 0.005474 **
## testers_count 6.055e-03 9.574e-05 63.240 < 2e-16 ***
## coursepassrate 6.610e-04 3.603e-03 0.183 0.854430
## elaprof -7.009e-03 1.871e-03 -3.747 0.000179 ***
## mathprof 3.404e-02 1.749e-03 19.461 < 2e-16 ***
## surveysafety -3.903e-03 1.600e-03 -2.440 0.014705 *
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.128631 2.398116 3.390 0.0007 ***
## testers_count -0.052219 0.004471 -11.678 < 2e-16 ***
## coursepassrate -0.030935 0.023298 -1.328 0.1843
## elaprof -0.009621 0.017457 -0.551 0.5816
## mathprof -0.075576 0.016377 -4.615 3.94e-06 ***
## surveysafety 0.029381 0.017472 1.682 0.0926 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 17
## Log-likelihood: -1897 on 12 Df
The third model’s RMSE is
Though the first model has the lowest RMSE, the zero-inflated model has a better AIC measure.
AIC(first_model, glm_second_model, zip_model)
With this and the importance from the summary for the model, the
variables elaprof and mathprof. According to
the data:
elaprof: Percent of students proficient in ELAmathprof: Percent of students proficient in mathBecause of the strong correlations between proficiency in ELA and math, I would recommend schools work to improve these proficiencies to have a better success rate for obtaining offers from Specialized High Schools (including raising awareness of the examinations and encouraging students to take them). Overall, I selected a middle school with these high variables close to my neighborhood.
Additional research can be done to look at demographics and economic data, but overall, to improve proficiencies in ELA and mathematics seem to be a viable way to go for preparing students and middle schools