In this project, I will analyze data from the New York City school department to understand whether parent, teacher, and student perceptions of the following factors affect average school SAT scores (an indicator of academic performance).

  1. Safety
  2. Engagement
  3. Communication
  4. Academics

In this project, I’ll investigate the following questions:

  1. Do student, teacher, and parent perceptions of NYC school quality appear to be related to demographic and academic success metrics?

  2. Do students, teachers, and parents have similar perceptions of NYC school quality?

I’ll start by loading the tidyverse package. The tidyverse contains several of the packages needed for this analysis. These packages include readr, dplyr, tidyr, ggplot2, purrr, and stringr.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

I am importing the data using read_csv and read_tsv from the readr package.

combined <- read_csv("combined.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   DBN = col_character(),
##   school_name = col_character(),
##   boro = col_character()
## )
## See spec(...) for full column specifications.
survey <- read_tsv("masterfile11_gened_final.txt")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   dbn = col_character(),
##   bn = col_character(),
##   schoolname = col_character(),
##   studentssurveyed = col_character(),
##   schooltype = col_character(),
##   p_q1 = col_logical(),
##   p_q3d = col_logical(),
##   p_q9 = col_logical(),
##   p_q10 = col_logical(),
##   p_q12aa = col_logical(),
##   p_q12ab = col_logical(),
##   p_q12ac = col_logical(),
##   p_q12ad = col_logical(),
##   p_q12ba = col_logical(),
##   p_q12bb = col_logical(),
##   p_q12bc = col_logical(),
##   p_q12bd = col_logical(),
##   t_q6m = col_logical(),
##   t_q9 = col_logical(),
##   t_q10a = col_logical()
##   # ... with 18 more columns
## )
## See spec(...) for full column specifications.
survey_d75 <- read_tsv("masterfile11_d75_final.txt")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   dbn = col_character(),
##   bn = col_character(),
##   schoolname = col_character(),
##   studentssurveyed = col_character(),
##   schooltype = col_character(),
##   p_q5 = col_logical(),
##   p_q9 = col_logical(),
##   p_q13a = col_logical(),
##   p_q13b = col_logical(),
##   p_q13c = col_logical(),
##   p_q13d = col_logical(),
##   p_q14a = col_logical(),
##   p_q14b = col_logical(),
##   p_q14c = col_logical(),
##   p_q14d = col_logical(),
##   t_q11a = col_logical(),
##   t_q11b = col_logical(),
##   t_q14 = col_logical(),
##   t_q15a = col_logical(),
##   t_q15b = col_logical()
##   # ... with 14 more columns
## )
## See spec(...) for full column specifications.

Simplified the survey dataframe to include only the variables needed for analysis.

survey_select <- survey %>%
  filter(schooltype == "High School") %>%
  select(dbn:aca_tot_11)

Simplified the survey_d75 survey dataframe to include only the variables needed for analysis.

survey_d75_select <- survey_d75 %>%
  select(dbn:aca_tot_11)

Combined the survey_select and survey_d75_select dataframes.

survey_combo <- survey_select %>%
  bind_rows(survey_d75_select)

Rename survey_combo variable ‘dbn’ to ‘DBN’ so can use as a key to join with the ‘combined’ data frame.

survey_combo <- survey_combo %>%
  rename(DBN = dbn)

Joined the new survey dataframe I created to the combined dataframe.

combined1 <- combined %>%
  left_join(survey_combo, by = "DBN")

Created a correlation matrix to look for interesting relationships between pairs of variables in the combined1 dataframe I created.

cor_mat <- combined1 %>%
  select(avg_sat_score, saf_p_11:aca_tot_11) %>%
  cor(use = "pairwise.complete.obs")

Converted a correlation matrix to a tibble to make it easier to work with.

cor_tib <- cor_mat %>%
  as_tibble(rownames = "variable")

Looked for correlations of other variables with ‘avg_sat_score’ that are greater than 0.25 or less than -0.25.

strong_cors <- cor_tib %>%
  select(variable, avg_sat_score) %>%
  filter(avg_sat_score > 0.25 | avg_sat_score < -0.25)

Made scatter plots of those variables with avg_sat_score to examine relationships more closely.

create_scatter <- function(x,y) {
  ggplot(data = combined1) +
    aes_string(x = x, y = y) +
    geom_point(alpha = 0.3) +
    theme(panel.background = element_rect(fill="white"))
}

x_var <- strong_cors$variable[2:5]
y_var <- "avg_sat_score"

map2(x_var, y_var, create_scatter)
## [[1]]
## Warning: Removed 137 rows containing missing values (geom_point).

## 
## [[2]]
## Warning: Removed 139 rows containing missing values (geom_point).

## 
## [[3]]
## Warning: Removed 139 rows containing missing values (geom_point).

## 
## [[4]]
## Warning: Removed 137 rows containing missing values (geom_point).

Used the tidyr function pivot_longer() to reshape your dataframe so that the survey question and the score are in two separate columns.

survey_combo_longer <- survey_combo %>%
  pivot_longer(cols = saf_p_11:aca_tot_11,
  names_to = "survey_question",
  values_to = "score")

Created new variables, using str_sub() to extract information from values of the survey question variable.

survey_combo_longer <- survey_combo_longer %>%
  mutate(response_type = str_sub(survey_question, 4, 6)) %>%
  mutate(metric = str_sub(survey_question, 1, 3))

Replaced ‘response type’ variable name with more intuitive variable names.

survey_combo_longer <- survey_combo_longer %>%
  mutate(response_type = ifelse(response_type == "_p_", "parent",
                          ifelse(response_type =="_t_","teacher",
                           ifelse(response_type =="_s_","student",     
                                  ifelse(response_type == "_to_", "total", "NA")))))

Explored differences in parent, teacher, and student perceptions of NYC school quality metrics using a box plot.

survey_combo_longer %>%
  filter(response_type != "total") %>%
  ggplot(aes(x = metric, y = score, fill = response_type)) +
  geom_boxplot()
## Warning: Removed 16 rows containing non-finite values (stat_boxplot).