Introduction

The high school data I am analyzing was compiled and published by the New York City Department of Education, with SAT score averages and testing rates provided by the College Board. This data is crucial for addressing my research question: “Is there a relationship between average sat scores in maths,reading,writing,school name and the city

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(treemap)
library(RColorBrewer)
library(ggfortify)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Calling the data set for score

setwd("C:/Users/eyong/OneDrive - montgomerycollege.edu/Desktop/data110/New folder")
df<-read_csv("scores.csv")
## Rows: 435 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (13): School ID, School Name, Borough, Building Code, Street Address, C...
## dbl   (7): Zip Code, Latitude, Longitude, Student Enrollment, Average Score ...
## time  (2): Start Time, End Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Examining the data frame

summary(df)
##   School ID         School Name          Borough          Building Code     
##  Length:435         Length:435         Length:435         Length:435        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Street Address         City              State              Zip Code    
##  Length:435         Length:435         Length:435         Min.   :10001  
##  Class :character   Class :character   Class :character   1st Qu.:10302  
##  Mode  :character   Mode  :character   Mode  :character   Median :10472  
##                                                           Mean   :10726  
##                                                           3rd Qu.:11217  
##                                                           Max.   :11694  
##                                                                          
##     Latitude       Longitude      Phone Number        Start Time      
##  Min.   :40.53   Min.   :-74.19   Length:435         Length:435       
##  1st Qu.:40.69   1st Qu.:-73.98   Class :character   Class1:hms       
##  Median :40.74   Median :-73.93   Mode  :character   Class2:difftime  
##  Mean   :40.75   Mean   :-73.92                      Mode  :numeric   
##  3rd Qu.:40.82   3rd Qu.:-73.88                                       
##  Max.   :40.89   Max.   :-73.73                                       
##                                                                       
##    End Time        Student Enrollment Percent White      Percent Black     
##  Length:435        Min.   :  62.0     Length:435         Length:435        
##  Class1:hms        1st Qu.: 357.2     Class :character   Class :character  
##  Class2:difftime   Median : 466.0     Mode  :character   Mode  :character  
##  Mode  :numeric    Mean   : 702.4                                          
##                    3rd Qu.: 623.2                                          
##                    Max.   :5447.0                                          
##                    NA's   :7                                               
##  Percent Hispanic   Percent Asian      Average Score (SAT Math)
##  Length:435         Length:435         Min.   :317.0           
##  Class :character   Class :character   1st Qu.:386.0           
##  Mode  :character   Mode  :character   Median :415.0           
##                                        Mean   :432.9           
##                                        3rd Qu.:458.5           
##                                        Max.   :754.0           
##                                        NA's   :60              
##  Average Score (SAT Reading) Average Score (SAT Writing) Percent Tested    
##  Min.   :302.0               Min.   :284.0               Length:435        
##  1st Qu.:386.0               1st Qu.:382.0               Class :character  
##  Median :413.0               Median :403.0               Mode  :character  
##  Mean   :424.5               Mean   :418.5                                 
##  3rd Qu.:445.0               3rd Qu.:437.5                                 
##  Max.   :697.0               Max.   :693.0                                 
##  NA's   :60                  NA's   :60

cleaning colums names

gf<-janitor::clean_names(df)

cleaning all NA’s for the data chosen

cleaned_data <- gf %>%
  filter(if_all(c("average_score_sat_math","average_score_sat_reading","percent_tested","city","school_name"), ~ !is.na(.)))
print(cleaned_data)
## # A tibble: 375 × 22
##    school_id school_name        borough building_code street_address city  state
##    <chr>     <chr>              <chr>   <chr>         <chr>          <chr> <chr>
##  1 01M539    New Explorations … Manhat… M022          111 Columbia … Manh… NY   
##  2 02M294    Essex Street Acad… Manhat… M445          350 Grand Str… Manh… NY   
##  3 02M308    Lower Manhattan A… Manhat… M445          350 Grand Str… Manh… NY   
##  4 02M545    High School for D… Manhat… M445          350 Grand Str… Manh… NY   
##  5 01M292    Henry Street Scho… Manhat… M056          220 Henry Str… Manh… NY   
##  6 01M696    Bard High School … Manhat… M097          525 East Hous… Manh… NY   
##  7 02M305    Urban Assembly Ac… Manhat… M445          350 Grand Str… Manh… NY   
##  8 01M509    Marta Valle High … Manhat… M025          145 Stanton S… Manh… NY   
##  9 01M448    University Neighb… Manhat… M446          200 Monroe St… Manh… NY   
## 10 02M543    New Design High S… Manhat… M445          350 Grand Str… Manh… NY   
## # ℹ 365 more rows
## # ℹ 15 more variables: zip_code <dbl>, latitude <dbl>, longitude <dbl>,
## #   phone_number <chr>, start_time <time>, end_time <time>,
## #   student_enrollment <dbl>, percent_white <chr>, percent_black <chr>,
## #   percent_hispanic <chr>, percent_asian <chr>, average_score_sat_math <dbl>,
## #   average_score_sat_reading <dbl>, average_score_sat_writing <dbl>,
## #   percent_tested <chr>

choosing the columns

score <- cleaned_data[, c("average_score_sat_math","average_score_sat_reading","city","average_score_sat_writing")]

Finding correlation between average SAT in maths and reading

correlation <- cor(score$average_score_sat_math, score$average_score_sat_reading)
print(paste("Correlation between average SAT Math and Reading scores:", round(correlation, 4)))
## [1] "Correlation between average SAT Math and Reading scores: 0.9282"

plotting correlation

ggplot(score, aes(x = average_score_sat_math, y = average_score_sat_reading)) +
  geom_point(aes(color = city), alpha = 0.7) +  
  geom_smooth(method = "lm", se = FALSE, color = "blue") +  
  labs(title = "Correlation between Average SAT Math and Reading Scores",
       x = "Average SAT Math Score",
       y = "Average SAT Reading Score") +
  theme_minimal() +
  theme(legend.position = "right")  
## `geom_smooth()` using formula = 'y ~ x'

The correlation value of 0.9282 indicates a strong relationship between the average SAT math and reading scores, as it is very close to one. This suggests that as scores in one subject increase, scores in the other tend to increase as well, highlighting the interdependence of these two academic areas.

Calculating the average mean for SAT scores per city

score_long <- pivot_longer(score, 
                           cols = c("average_score_sat_math", "average_score_sat_reading","average_score_sat_writing"), 
                           names_to = "Score_Type", 
                           values_to = "Score_Value")

Heat map

library(RColorBrewer)

ggplot(score_long, aes(x = city, y = Score_Type, fill = Score_Value)) +
  geom_tile(color = "white") +  
  scale_fill_gradientn(colors = brewer.pal(11, "RdYlBu")) +  # Use RColorBrewer
  labs(title = "Heatmap of SAT Scores by City",
       x = "City",
       y = "Score Type",
       fill = "Score Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The heatmap illustrates that Manhattan has high SAT scores in math, writing, and reading, suggesting that this city may be an ideal choice for a child aspiring to succeed in school.

Regretion analysis

score$city <- as.factor(score$city)

model <- lm(average_score_sat_writing ~ average_score_sat_math + average_score_sat_reading, data = score)

summary(model)
## 
## Call:
## lm(formula = average_score_sat_writing ~ average_score_sat_math + 
##     average_score_sat_reading, data = score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.108  -6.571   0.207   6.614  31.714 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -14.71851    3.78848  -3.885 0.000121 ***
## average_score_sat_math      0.12598    0.02023   6.227 1.29e-09 ***
## average_score_sat_reading   0.89194    0.02352  37.915  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.47 on 372 degrees of freedom
## Multiple R-squared:  0.9738, Adjusted R-squared:  0.9737 
## F-statistic:  6918 on 2 and 372 DF,  p-value: < 2.2e-16

This regression analysis seeks to determine whether the average SAT writing score is influenced by the average SAT scores in math and reading. The R-squared value is 0.9737, indicating a strong correlation closer to 1, which suggests a high significance in the relationship. Therefore, we can conclude that writing scores are indeed affected by both math and reading scores.

Diagnostic plots

par(mfrow=c(2,2))
plot(model)

Conclusion

The correlation value of 0.9282 indicates a strong relationship between the average SAT math and reading scores, suggesting that as scores in one subject increase, scores in the other tend to rise as well, highlighting the interdependence of these academic areas. The heatmap reveals that Manhattan shows high SAT scores in math, writing, and reading, indicating it may be an ideal city for a child aspiring to succeed in school. Additionally, the regression analysis aims to determine whether the average SAT writing score is influenced by the average SAT scores in math and reading. With an R-squared value of 0.9737, this indicates a strong correlation, suggesting that writing scores are significantly affected by both math and reading scores.

Concerns

In my analysis, I used the janitor library to clean the dataset, which involved fixing column names that contained spaces and capital letters. This process was necessary to make the dataset more manageable, as it was a bit of a headache initially. Once I resolved this issue, cleaning the data by removing NA values from relevant columns was straightforward.

Refrences

published by the New York City Department of Education

this is the link to kaggle Average SAT Scores for NYC Public Schools (kaggle.com)