In this project, I will analyze data from the New York City school department to understand whether parent, teacher, and student perceptions of the following factors affect average school SAT scores (an indicator of academic performance):

  1. Safety
  2. Engagement
  3. Communication
  4. Academics

The data, collected in 2011, are publicly available and can be accessed here.

https://data.cityofnewyork.us/Education/2011-NYC-School-Survey/mnz3-dyi8

https://data.world/dataquest/nyc-schools-data/workspace/file?filename=combined.csv

https://data.cityofnewyork.us/Education/2011-NYC-School-Survey/mnz3-dyi8

I’ll start by loading the packages that I’ll need for this analysis:

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(purrr)
library(tidyr)
library(ggplot2)

Then we can import our data into R as dataframes:

combined <- read_csv("combined.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   DBN = col_character(),
##   school_name = col_character(),
##   boro = col_character()
## )
## See spec(...) for full column specifications.
survey_general_ed <- read_tsv("masterfile11_gened_final.txt")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   dbn = col_character(),
##   bn = col_character(),
##   schoolname = col_character(),
##   studentssurveyed = col_character(),
##   schooltype = col_character(),
##   p_q1 = col_logical(),
##   p_q3d = col_logical(),
##   p_q9 = col_logical(),
##   p_q10 = col_logical(),
##   p_q12aa = col_logical(),
##   p_q12ab = col_logical(),
##   p_q12ac = col_logical(),
##   p_q12ad = col_logical(),
##   p_q12ba = col_logical(),
##   p_q12bb = col_logical(),
##   p_q12bc = col_logical(),
##   p_q12bd = col_logical(),
##   t_q6m = col_logical(),
##   t_q9 = col_logical(),
##   t_q10a = col_logical()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.
survey_d75 <- read_tsv("masterfile11_d75_final.txt")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   dbn = col_character(),
##   bn = col_character(),
##   schoolname = col_character(),
##   studentssurveyed = col_character(),
##   schooltype = col_character(),
##   p_q5 = col_logical(),
##   p_q9 = col_logical(),
##   p_q13a = col_logical(),
##   p_q13b = col_logical(),
##   p_q13c = col_logical(),
##   p_q13d = col_logical(),
##   p_q14a = col_logical(),
##   p_q14b = col_logical(),
##   p_q14c = col_logical(),
##   p_q14d = col_logical(),
##   t_q11a = col_logical(),
##   t_q11b = col_logical(),
##   t_q14 = col_logical(),
##   t_q15a = col_logical(),
##   t_q15b = col_logical()
##   # ... with 14 more columns
## )
## See spec(...) for full column specifications.

Filter the survey dataframes to include only variables we will need: Filter survey of general ed to include only observations for high schools

survey_select <- survey_general_ed %>%
  filter(schooltype == "High School") %>%
  select(dbn:aca_tot_11)

Select only the columns needed for analysis from survey_d75.

survey_d75_select <- survey_d75 %>%       
  select(dbn:aca_tot_11)

Combine the general education and District 75 survey dataframes using the dplyr function bind_rows()

surveysel_d75_gened <- survey_select %>%
  bind_rows(survey_d75_select)

Join surveysel_d75_gened you created to the combined dataframe Data in DBN matches dbn; this is our key Rename ‘dbn’ column of surveysel_d75_gened in order to join combined

surveysel_d75_gened <- surveysel_d75_gened %>%
  rename(DBN = dbn) # New name = original

Using left_join() to join will retain only observations in the survey dataframe that correspond to observations in combined.

combined_survey <- combined %>%
  left_join(surveysel_d75_gened, by = "DBN")

Do student, teacher, and parent perceptions of NYC school quality appear to be related to demographic and academic success metrics?

First create correlation_matrix in combined_survey to examine the data when coupling variables for relationships.

Convert the correlation matrix into correlation_tibble to identify pairs of variables with strong/weak correlation coefficients. Then select variables, and filter observations.

Finally, create scatter plots to examine potentially interesting correlations more closely.

correlation_matrix <- combined_survey %>%
    select(avg_sat_score, saf_p_11:aca_tot_11) %>%
  cor(use = "pairwise.complete.obs")

correlation_tibble <- correlation_matrix %>% 
  as_tibble(rownames = "variable")

View(correlation_tibble)

Look for correlations of other variables with avg_sat_score that are greater than 0.25 or less than -0.25 (strong correlations).

strong_correlations <- correlation_tibble %>%
  select(variable, avg_sat_score) %>%
  filter(avg_sat_score > 0.25 | avg_sat_score < -0.25)  

Finally, create scatter plots of those variables with avg_sat_scoreto examine potentially interesting correlations more closely.

create_scatter <- function(x, y) {     
  ggplot(data = combined_survey) + 
    aes_string(x = x, y = y) +
    geom_point(alpha = 0.3) +
    theme(panel.background = element_rect(fill = "white"))
}

x_var <- strong_correlations$variable[2:5]
y_var <- "avg_sat_score"
  
map2(x_var, y_var, create_scatter)
## [[1]]
## Warning: Removed 137 rows containing missing values (geom_point).

## 
## [[2]]
## Warning: Removed 139 rows containing missing values (geom_point).

## 
## [[3]]
## Warning: Removed 139 rows containing missing values (geom_point).

## 
## [[4]]
## Warning: Removed 137 rows containing missing values (geom_point).

Do students, teachers, and parents have similar perceptions of NYC school quality?

Group the data by question and survey response type (parent, student, teacher, or total) and then calculate a summary average for each group.

Since visualizations of data are often easier to interpret, perhaps you’ll use box plots to examine score differences.

First reshape the data and create some new variables.

Use the tidyr function pivot_longer() to reshape your dataframe so that the survey question (like saf_p_11 and com_s_11) and the score (like 8.7 or 9.0) are in two separate columns.

combined_survey_gather <- combined_survey %>%
  pivot_longer(cols = saf_p_11:aca_tot_11,
               names_to = "survey_question",
               values_to = "score")

Use str_sub() to create new variables, response_type and question, from the survey_question variable.

combined_survey_gather <- combined_survey_gather %>%
  mutate(response_type = str_sub(survey_question, 4, 6)) %>%   
  mutate(question = str_sub(survey_question, 1, 3))

Use if_else() to assign more intuitive value names to the variables you created using str_sub()

# combined_survey_gather %>%
#  mutate(response_type = ifelse(response_type == "_p_", "parent", 
#                             ifelse(response_type == "_t_", "teacher","NA"))

combined_survey_gather <- combined_survey_gather %>%
  mutate(response_type = ifelse(response_type  == "_p_", "parent", 
                                ifelse(response_type == "_t_", "teacher",
                                       ifelse(response_type == "_s_", "student", 
                                              ifelse(response_type == "_to", "total", "NA")))))

Make a boxplot to see if there appear to be differences in how the three groups of responders (parents, students, and teachers) answered the four questions.

combined_survey_gather %>%
  filter(response_type != "total") %>%
  ggplot() +
  aes(x = question, y = score, fill = response_type) +
  geom_boxplot()
## Warning: Removed 1268 rows containing non-finite values (stat_boxplot).