Project 2-SLK

Author

Senay Leul Kahsay

Published

2025-04-17

Source: Options For Youth [https://ofy.org/blog/everything-need-get-sat-testing-day/]

SAT Scores and Family Income

Introduction

For the second project, I wanted to look at the education sector. More specifically, about the SAT. I found my data set from the National Center for Education Statistics. This data set consists of 577 observations and 99 variables. It has information of average SAT scores for each state from 2005 to 2015. In addition, it has more information about the average GPA of all students for several subjects in a given state and year and also the average number of years that a student has been studying a subject when they took the SAT exam. Moreover, this data set considers the average score of the Math and verbal portion of the SAT in regards to family income. These are going to be the main variables I will be exploring in this project.

I chose this data set because as someone who took the SAT recently, I want to find out more about what factors contribute to the score of the exam. I want to analyze the data in hopes of getting insights as to why some students might get higher SAT scores than others. Overall, this is an interesting topic that I am intrigued by.

# loading in the data set and libraries 
library(tidyverse)
library(DataExplorer)
library(plotly)
library(GGally)
library(ggfortify)
setwd("C:/Users/senay/OneDrive/Desktop/Scoo/Spring 2025/DATA 110/visual")
satscores <- read_csv("school_scores.csv")

Exploratory Data Analysis

head(satscores)
# A tibble: 6 × 99
   Year State.Code State.Name Total.Math `Total.Test-takers` Total.Verbal
  <dbl> <chr>      <chr>           <dbl>               <dbl>        <dbl>
1  2005 AL         Alabama           559                3985          567
2  2005 AK         Alaska            519                3996          523
3  2005 AZ         Arizona           530               18184          526
4  2005 AR         Arkansas          552                1600          563
5  2005 CA         California        522              186552          504
6  2005 CO         Colorado          560               11990          560
# ℹ 93 more variables: `Academic Subjects.Arts/Music.Average GPA` <dbl>,
#   `Academic Subjects.Arts/Music.Average Years` <dbl>,
#   `Academic Subjects.English.Average GPA` <dbl>,
#   `Academic Subjects.English.Average Years` <dbl>,
#   `Academic Subjects.Foreign Languages.Average GPA` <dbl>,
#   `Academic Subjects.Foreign Languages.Average Years` <dbl>,
#   `Academic Subjects.Mathematics.Average GPA` <dbl>, …
str (satscores)
spc_tbl_ [577 × 99] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Year                                                   : num [1:577] 2005 2005 2005 2005 2005 ...
 $ State.Code                                             : chr [1:577] "AL" "AK" "AZ" "AR" ...
 $ State.Name                                             : chr [1:577] "Alabama" "Alaska" "Arizona" "Arkansas" ...
 $ Total.Math                                             : num [1:577] 559 519 530 552 522 560 517 502 478 498 ...
 $ Total.Test-takers                                      : num [1:577] 3985 3996 18184 1600 186552 ...
 $ Total.Verbal                                           : num [1:577] 567 523 526 563 504 560 517 503 490 498 ...
 $ Academic Subjects.Arts/Music.Average GPA               : num [1:577] 3.92 3.76 3.85 3.9 3.76 3.88 3.66 3.71 3.54 3.77 ...
 $ Academic Subjects.Arts/Music.Average Years             : num [1:577] 2.2 1.9 2.1 2.2 1.8 2.2 2.1 1.8 1.8 1.8 ...
 $ Academic Subjects.English.Average GPA                  : num [1:577] 3.53 3.35 3.45 3.61 3.32 3.49 3.13 3.21 3.03 3.29 ...
 $ Academic Subjects.English.Average Years                : num [1:577] 3.9 3.9 3.9 4 3.8 4 3.9 3.9 3.8 3.8 ...
 $ Academic Subjects.Foreign Languages.Average GPA        : num [1:577] 3.54 3.34 3.41 3.64 3.29 3.41 3.03 3.18 3.04 3.3 ...
 $ Academic Subjects.Foreign Languages.Average Years      : num [1:577] 2.6 2.1 2.6 2.6 2.8 3.1 3.1 2.7 2.7 2.4 ...
 $ Academic Subjects.Mathematics.Average GPA              : num [1:577] 3.41 3.06 3.25 3.46 3.05 3.33 3 3.07 2.91 3.07 ...
 $ Academic Subjects.Mathematics.Average Years            : num [1:577] 4 3.5 3.9 4.1 3.7 3.9 3.8 3.8 3.7 3.8 ...
 $ Academic Subjects.Natural Sciences.Average GPA         : num [1:577] 3.52 3.25 3.43 3.55 3.2 3.43 3.07 3.19 2.99 3.27 ...
 $ Academic Subjects.Natural Sciences.Average Years       : num [1:577] 3.9 3.2 3.4 3.7 3.2 3.7 3.5 3.6 3.3 3.5 ...
 $ Academic Subjects.Social Sciences/History.Average GPA  : num [1:577] 3.59 3.39 3.55 3.67 3.38 3.56 3.18 3.3 3.11 3.39 ...
 $ Academic Subjects.Social Sciences/History.Average Years: num [1:577] 3.9 3.4 3.3 3.6 3.3 3.7 3.6 3.6 3.4 3.5 ...
 $ Family Income.Between 20-40k.Math                      : num [1:577] 513 492 498 513 477 533 463 449 391 471 ...
 $ Family Income.Between 20-40k.Test-takers               : num [1:577] 324 401 2121 180 26161 ...
 $ Family Income.Between 20-40k.Verbal                    : num [1:577] 527 500 495 526 458 535 467 454 404 473 ...
 $ Family Income.Between 40-60k.Math                      : num [1:577] 539 517 520 543 506 543 493 481 433 492 ...
 $ Family Income.Between 40-60k.Test-takers               : num [1:577] 442 539 2270 245 18347 ...
 $ Family Income.Between 40-60k.Verbal                    : num [1:577] 551 522 518 555 494 548 499 487 454 496 ...
 $ Family Income.Between 60-80k.Math                      : num [1:577] 550 513 524 553 521 553 507 497 470 504 ...
 $ Family Income.Between 60-80k.Test-takers               : num [1:577] 473 603 2372 227 17937 ...
 $ Family Income.Between 60-80k.Verbal                    : num [1:577] 564 519 523 570 511 552 511 501 482 505 ...
 $ Family Income.Between 80-100k.Math                     : num [1:577] 566 528 534 570 535 562 523 512 539 516 ...
 $ Family Income.Between 80-100k.Test-takers              : num [1:577] 475 444 1866 147 14120 ...
 $ Family Income.Between 80-100k.Verbal                   : num [1:577] 577 534 533 580 525 560 523 510 549 517 ...
 $ Family Income.Less than 20k.Math                       : num [1:577] 462 464 485 489 451 514 434 411 374 433 ...
 $ Family Income.Less than 20k.Test-takers                : num [1:577] 175 191 891 107 19323 ...
 $ Family Income.Less than 20k.Verbal                     : num [1:577] 474 467 474 486 421 505 426 410 377 431 ...
 $ Family Income.More than 100k.Math                      : num [1:577] 588 541 554 572 566 574 565 554 608 544 ...
 $ Family Income.More than 100k.Test-takers               : num [1:577] 980 540 3083 314 27984 ...
 $ Family Income.More than 100k.Verbal                    : num [1:577] 590 544 546 589 551 568 559 550 622 538 ...
 $ GPA.A minus.Math                                       : num [1:577] 569 544 541 559 562 573 585 534 566 530 ...
 $ GPA.A minus.Test-takers                                : num [1:577] 724 673 3334 298 30545 ...
 $ GPA.A minus.Verbal                                     : num [1:577] 575 546 535 572 538 570 577 532 574 526 ...
 $ GPA.A plus.Math                                        : num [1:577] 622 600 605 629 625 627 652 593 584 597 ...
 $ GPA.A plus.Test-takers                                 : num [1:577] 563 173 1684 273 7502 ...
 $ GPA.A plus.Verbal                                      : num [1:577] 623 604 593 639 603 614 643 585 578 589 ...
 $ GPA.A.Math                                             : num [1:577] 600 580 571 579 592 602 616 558 559 554 ...
 $ GPA.A.Test-takers                                      : num [1:577] 1032 671 3854 457 25546 ...
 $ GPA.A.Verbal                                           : num [1:577] 608 578 563 583 565 598 606 556 570 550 ...
 $ GPA.B.Math                                             : num [1:577] 514 492 498 492 494 526 506 481 466 474 ...
 $ GPA.B.Test-takers                                      : num [1:577] 1253 1622 7193 437 84659 ...
 $ GPA.B.Verbal                                           : num [1:577] 525 499 499 511 480 529 506 482 479 476 ...
 $ GPA.C.Math                                             : num [1:577] 436 466 458 419 434 484 431 422 370 420 ...
 $ GPA.C.Test-takers                                      : num [1:577] 188 418 1184 57 18839 ...
 $ GPA.C.Verbal                                           : num [1:577] 451 472 464 436 427 489 442 430 386 426 ...
 $ GPA.D or lower.Math                                    : num [1:577] 0 424 439 0 419 457 395 392 323 395 ...
 $ GPA.D or lower.Test-takers                             : num [1:577] 0 12 16 0 240 12 105 19 13 111 ...
 $ GPA.D or lower.Verbal                                  : num [1:577] 0 466 435 0 408 462 407 399 377 408 ...
 $ GPA.No response.Math                                   : num [1:577] 0 0 0 0 0 0 0 0 0 0 ...
 $ GPA.No response.Test-takers                            : num [1:577] 225 427 919 78 19221 ...
 $ GPA.No response.Verbal                                 : num [1:577] 0 0 0 0 0 0 0 0 0 0 ...
 $ Gender.Female.Math                                     : num [1:577] 538 505 513 536 504 546 502 486 451 484 ...
 $ Gender.Female.Test-takers                              : num [1:577] 2072 2161 9806 859 102944 ...
 $ Gender.Female.Verbal                                   : num [1:577] 561 521 522 558 499 558 513 498 475 496 ...
 $ Gender.Male.Math                                       : num [1:577] 582 535 549 570 543 577 534 521 509 516 ...
 $ Gender.Male.Test-takers                                : num [1:577] 1913 1835 8378 741 83608 ...
 $ Gender.Male.Verbal                                     : num [1:577] 574 526 531 570 510 561 520 508 508 502 ...
 $ Score Ranges.Between 200 to 300.Math.Females           : num [1:577] 22 30 119 12 2978 ...
 $ Score Ranges.Between 200 to 300.Math.Males             : num [1:577] 10 20 72 7 1453 ...
 $ Score Ranges.Between 200 to 300.Math.Total             : num [1:577] 32 50 191 19 4431 ...
 $ Score Ranges.Between 200 to 300.Verbal.Females         : num [1:577] 14 26 115 9 3382 ...
 $ Score Ranges.Between 200 to 300.Verbal.Males           : num [1:577] 17 26 86 3 2433 ...
 $ Score Ranges.Between 200 to 300.Verbal.Total           : num [1:577] 31 52 201 12 5815 ...
 $ Score Ranges.Between 300 to 400.Math.Females           : num [1:577] 173 233 881 68 14595 ...
 $ Score Ranges.Between 300 to 400.Math.Males             : num [1:577] 93 153 450 31 7159 ...
 $ Score Ranges.Between 300 to 400.Math.Total             : num [1:577] 266 386 1331 99 21754 ...
 $ Score Ranges.Between 300 to 400.Verbal.Females         : num [1:577] 123 218 739 46 15386 ...
 $ Score Ranges.Between 300 to 400.Verbal.Males           : num [1:577] 84 171 613 42 10784 ...
 $ Score Ranges.Between 300 to 400.Verbal.Total           : num [1:577] 207 389 1352 88 26170 ...
 $ Score Ranges.Between 400 to 500.Math.Females           : num [1:577] 514 696 3215 210 31530 ...
 $ Score Ranges.Between 400 to 500.Math.Males             : num [1:577] 293 485 1948 137 20172 ...
 $ Score Ranges.Between 400 to 500.Math.Total             : num [1:577] 807 1181 5163 347 51702 ...
 $ Score Ranges.Between 400 to 500.Verbal.Females         : num [1:577] 430 656 3048 183 32897 ...
 $ Score Ranges.Between 400 to 500.Verbal.Males           : num [1:577] 332 552 2398 141 25260 ...
 $ Score Ranges.Between 400 to 500.Verbal.Total           : num [1:577] 762 1208 5446 324 58157 ...
 $ Score Ranges.Between 500 to 600.Math.Females           : num [1:577] 722 813 3576 316 30765 ...
 $ Score Ranges.Between 500 to 600.Math.Males             : num [1:577] 614 616 3152 244 26052 ...
 $ Score Ranges.Between 500 to 600.Math.Total             : num [1:577] 1336 1429 6728 560 56817 ...
 $ Score Ranges.Between 500 to 600.Verbal.Females         : num [1:577] 690 729 3661 302 30190 ...
 $ Score Ranges.Between 500 to 600.Verbal.Males           : num [1:577] 617 596 3101 236 25399 ...
 $ Score Ranges.Between 500 to 600.Verbal.Total           : num [1:577] 1307 1325 6762 538 55589 ...
 $ Score Ranges.Between 600 to 700.Math.Females           : num [1:577] 485 342 1688 204 17625 ...
 $ Score Ranges.Between 600 to 700.Math.Males             : num [1:577] 611 445 2126 239 19980 ...
 $ Score Ranges.Between 600 to 700.Math.Total             : num [1:577] 1096 787 3814 443 37605 ...
 $ Score Ranges.Between 600 to 700.Verbal.Females         : num [1:577] 596 423 1831 242 16078 ...
 $ Score Ranges.Between 600 to 700.Verbal.Males           : num [1:577] 613 375 1679 226 14966 ...
 $ Score Ranges.Between 600 to 700.Verbal.Total           : num [1:577] 1209 798 3510 468 31044 ...
 $ Score Ranges.Between 700 to 800.Math.Females           : num [1:577] 156 47 327 49 5451 ...
 $ Score Ranges.Between 700 to 800.Math.Males             : num [1:577] 292 116 630 83 8792 ...
 $ Score Ranges.Between 700 to 800.Math.Total             : num [1:577] 448 163 957 132 14243 ...
 $ Score Ranges.Between 700 to 800.Verbal.Females         : num [1:577] 219 109 412 77 5011 ...
 $ Score Ranges.Between 700 to 800.Verbal.Males           : num [1:577] 250 115 501 93 4766 ...
 $ Score Ranges.Between 700 to 800.Verbal.Total           : num [1:577] 469 224 913 170 9777 ...
 - attr(*, "spec")=
  .. cols(
  ..   Year = col_double(),
  ..   State.Code = col_character(),
  ..   State.Name = col_character(),
  ..   Total.Math = col_double(),
  ..   `Total.Test-takers` = col_double(),
  ..   Total.Verbal = col_double(),
  ..   `Academic Subjects.Arts/Music.Average GPA` = col_double(),
  ..   `Academic Subjects.Arts/Music.Average Years` = col_double(),
  ..   `Academic Subjects.English.Average GPA` = col_double(),
  ..   `Academic Subjects.English.Average Years` = col_double(),
  ..   `Academic Subjects.Foreign Languages.Average GPA` = col_double(),
  ..   `Academic Subjects.Foreign Languages.Average Years` = col_double(),
  ..   `Academic Subjects.Mathematics.Average GPA` = col_double(),
  ..   `Academic Subjects.Mathematics.Average Years` = col_double(),
  ..   `Academic Subjects.Natural Sciences.Average GPA` = col_double(),
  ..   `Academic Subjects.Natural Sciences.Average Years` = col_double(),
  ..   `Academic Subjects.Social Sciences/History.Average GPA` = col_double(),
  ..   `Academic Subjects.Social Sciences/History.Average Years` = col_double(),
  ..   `Family Income.Between 20-40k.Math` = col_double(),
  ..   `Family Income.Between 20-40k.Test-takers` = col_double(),
  ..   `Family Income.Between 20-40k.Verbal` = col_double(),
  ..   `Family Income.Between 40-60k.Math` = col_double(),
  ..   `Family Income.Between 40-60k.Test-takers` = col_double(),
  ..   `Family Income.Between 40-60k.Verbal` = col_double(),
  ..   `Family Income.Between 60-80k.Math` = col_double(),
  ..   `Family Income.Between 60-80k.Test-takers` = col_double(),
  ..   `Family Income.Between 60-80k.Verbal` = col_double(),
  ..   `Family Income.Between 80-100k.Math` = col_double(),
  ..   `Family Income.Between 80-100k.Test-takers` = col_double(),
  ..   `Family Income.Between 80-100k.Verbal` = col_double(),
  ..   `Family Income.Less than 20k.Math` = col_double(),
  ..   `Family Income.Less than 20k.Test-takers` = col_double(),
  ..   `Family Income.Less than 20k.Verbal` = col_double(),
  ..   `Family Income.More than 100k.Math` = col_double(),
  ..   `Family Income.More than 100k.Test-takers` = col_double(),
  ..   `Family Income.More than 100k.Verbal` = col_double(),
  ..   `GPA.A minus.Math` = col_double(),
  ..   `GPA.A minus.Test-takers` = col_double(),
  ..   `GPA.A minus.Verbal` = col_double(),
  ..   `GPA.A plus.Math` = col_double(),
  ..   `GPA.A plus.Test-takers` = col_double(),
  ..   `GPA.A plus.Verbal` = col_double(),
  ..   GPA.A.Math = col_double(),
  ..   `GPA.A.Test-takers` = col_double(),
  ..   GPA.A.Verbal = col_double(),
  ..   GPA.B.Math = col_double(),
  ..   `GPA.B.Test-takers` = col_double(),
  ..   GPA.B.Verbal = col_double(),
  ..   GPA.C.Math = col_double(),
  ..   `GPA.C.Test-takers` = col_double(),
  ..   GPA.C.Verbal = col_double(),
  ..   `GPA.D or lower.Math` = col_double(),
  ..   `GPA.D or lower.Test-takers` = col_double(),
  ..   `GPA.D or lower.Verbal` = col_double(),
  ..   `GPA.No response.Math` = col_double(),
  ..   `GPA.No response.Test-takers` = col_double(),
  ..   `GPA.No response.Verbal` = col_double(),
  ..   Gender.Female.Math = col_double(),
  ..   `Gender.Female.Test-takers` = col_double(),
  ..   Gender.Female.Verbal = col_double(),
  ..   Gender.Male.Math = col_double(),
  ..   `Gender.Male.Test-takers` = col_double(),
  ..   Gender.Male.Verbal = col_double(),
  ..   `Score Ranges.Between 200 to 300.Math.Females` = col_double(),
  ..   `Score Ranges.Between 200 to 300.Math.Males` = col_double(),
  ..   `Score Ranges.Between 200 to 300.Math.Total` = col_double(),
  ..   `Score Ranges.Between 200 to 300.Verbal.Females` = col_double(),
  ..   `Score Ranges.Between 200 to 300.Verbal.Males` = col_double(),
  ..   `Score Ranges.Between 200 to 300.Verbal.Total` = col_double(),
  ..   `Score Ranges.Between 300 to 400.Math.Females` = col_double(),
  ..   `Score Ranges.Between 300 to 400.Math.Males` = col_double(),
  ..   `Score Ranges.Between 300 to 400.Math.Total` = col_double(),
  ..   `Score Ranges.Between 300 to 400.Verbal.Females` = col_double(),
  ..   `Score Ranges.Between 300 to 400.Verbal.Males` = col_double(),
  ..   `Score Ranges.Between 300 to 400.Verbal.Total` = col_double(),
  ..   `Score Ranges.Between 400 to 500.Math.Females` = col_double(),
  ..   `Score Ranges.Between 400 to 500.Math.Males` = col_double(),
  ..   `Score Ranges.Between 400 to 500.Math.Total` = col_double(),
  ..   `Score Ranges.Between 400 to 500.Verbal.Females` = col_double(),
  ..   `Score Ranges.Between 400 to 500.Verbal.Males` = col_double(),
  ..   `Score Ranges.Between 400 to 500.Verbal.Total` = col_double(),
  ..   `Score Ranges.Between 500 to 600.Math.Females` = col_double(),
  ..   `Score Ranges.Between 500 to 600.Math.Males` = col_double(),
  ..   `Score Ranges.Between 500 to 600.Math.Total` = col_double(),
  ..   `Score Ranges.Between 500 to 600.Verbal.Females` = col_double(),
  ..   `Score Ranges.Between 500 to 600.Verbal.Males` = col_double(),
  ..   `Score Ranges.Between 500 to 600.Verbal.Total` = col_double(),
  ..   `Score Ranges.Between 600 to 700.Math.Females` = col_double(),
  ..   `Score Ranges.Between 600 to 700.Math.Males` = col_double(),
  ..   `Score Ranges.Between 600 to 700.Math.Total` = col_double(),
  ..   `Score Ranges.Between 600 to 700.Verbal.Females` = col_double(),
  ..   `Score Ranges.Between 600 to 700.Verbal.Males` = col_double(),
  ..   `Score Ranges.Between 600 to 700.Verbal.Total` = col_double(),
  ..   `Score Ranges.Between 700 to 800.Math.Females` = col_double(),
  ..   `Score Ranges.Between 700 to 800.Math.Males` = col_double(),
  ..   `Score Ranges.Between 700 to 800.Math.Total` = col_double(),
  ..   `Score Ranges.Between 700 to 800.Verbal.Females` = col_double(),
  ..   `Score Ranges.Between 700 to 800.Verbal.Males` = col_double(),
  ..   `Score Ranges.Between 700 to 800.Verbal.Total` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Cleaning the data set

#making the letters lowercase
names(satscores) <- tolower(names(satscores))

#substituting spaces with underscores
names(satscores) <- gsub(" ", "_", names(satscores))

#substituting periods with underscores
names(satscores) <- gsub("[.]", "_", names(satscores))

#removing slashes to make working with columns easier 
names(satscores) <- gsub("[/]", "_", names(satscores))

#shortening column names
names(satscores) <- gsub("academic_subjects_", "", names(satscores))

#removing hyphens 
names(satscores) <- gsub("[-]", "_", names(satscores))

Statistical Analysis

Making a model of average SAT math scores as the response variable and average GPA values of different subjects as the predictors.

# begin by exploring the correlation between the average GPA of different subjects and the SAT test scores
plot_correlation(satscores[c(4,6,7,9,11,13,15,17)])

There is a very strong correlation between the average math and verbal score of the students across different states and years. Overall, there is a strong correlation between the average math and verbal portions of the test to the average GPA of all the different subjects included.

#making a multiple regression model for the average score of the math based on multiple variables(the average GPA of the subjects) 
mathmodel <- lm(total_math ~ arts_music_average_gpa + english_average_gpa + foreign_languages_average_gpa + mathematics_average_gpa + natural_sciences_average_gpa + social_sciences_history_average_gpa, data = satscores)
summary(mathmodel)

Call:
lm(formula = total_math ~ arts_music_average_gpa + english_average_gpa + 
    foreign_languages_average_gpa + mathematics_average_gpa + 
    natural_sciences_average_gpa + social_sciences_history_average_gpa, 
    data = satscores)

Residuals:
     Min       1Q   Median       3Q      Max 
-112.404   -9.495    1.123   16.179   87.040 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          -280.84      77.54  -3.622 0.000319 ***
arts_music_average_gpa                 71.82      35.74   2.009 0.044987 *  
english_average_gpa                  -116.64      35.56  -3.280 0.001100 ** 
foreign_languages_average_gpa         -69.67      23.05  -3.022 0.002620 ** 
mathematics_average_gpa               115.43      24.58   4.696 3.32e-06 ***
natural_sciences_average_gpa           33.04      46.67   0.708 0.479278    
social_sciences_history_average_gpa   197.58      45.51   4.341 1.68e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.6 on 570 degrees of freedom
Multiple R-squared:  0.6959,    Adjusted R-squared:  0.6927 
F-statistic: 217.4 on 6 and 570 DF,  p-value: < 2.2e-16
autoplot(mathmodel, 1:4, nrow=2, ncol = 2)

I remove the average GPA values of natural science because it is not statistically significant

mathmodel2 <- lm(total_math ~ arts_music_average_gpa + english_average_gpa + foreign_languages_average_gpa + mathematics_average_gpa + social_sciences_history_average_gpa, data = satscores)
summary(mathmodel2)

Call:
lm(formula = total_math ~ arts_music_average_gpa + english_average_gpa + 
    foreign_languages_average_gpa + mathematics_average_gpa + 
    social_sciences_history_average_gpa, data = satscores)

Residuals:
     Min       1Q   Median       3Q      Max 
-113.376   -9.697    1.194   16.105   88.252 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          -296.18      74.41  -3.980 7.77e-05 ***
arts_music_average_gpa                 76.71      35.05   2.188  0.02905 *  
english_average_gpa                  -112.50      35.06  -3.209  0.00141 ** 
foreign_languages_average_gpa         -66.17      22.50  -2.940  0.00341 ** 
mathematics_average_gpa               126.61      18.82   6.727 4.22e-11 ***
social_sciences_history_average_gpa   210.63      41.60   5.064 5.56e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.59 on 571 degrees of freedom
Multiple R-squared:  0.6956,    Adjusted R-squared:  0.6929 
F-statistic:   261 on 5 and 571 DF,  p-value: < 2.2e-16
autoplot(mathmodel2, 1:4, nrow=2, ncol = 2)

  • The equation for this model would be: average math SAT score = 76.71(average art/music gpa) - 112.5(average English gpa) - 66.17(average foreign language gpa) + 126.61(average maths gpa) + 210.63(average social science/history gpa) - 296.18

  • The p-value of all the explanatory variables in this model is very small which means that this model is statistically significant.

  • The adjusted r-squared value is 0.6926 which means 69.26% of variation in average math SAT score can be explained by the model.

  • Based on the p-value and the r-squared value, we can say this model is decent, however, the diagnostic plot tells us that this model has some flaws.

Making another model of average verbal(reading not writing) SAT score as the response variable and the average GPA valuse of subjects as the predictors.

verbalmodel <- lm(total_verbal ~ arts_music_average_gpa + english_average_gpa + foreign_languages_average_gpa + mathematics_average_gpa + natural_sciences_average_gpa + social_sciences_history_average_gpa, data = satscores)
summary(verbalmodel)

Call:
lm(formula = total_verbal ~ arts_music_average_gpa + english_average_gpa + 
    foreign_languages_average_gpa + mathematics_average_gpa + 
    natural_sciences_average_gpa + social_sciences_history_average_gpa, 
    data = satscores)

Residuals:
    Min      1Q  Median      3Q     Max 
-96.615  -8.957   2.461  13.190  73.718 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          -174.25      66.90  -2.604 0.009443 ** 
arts_music_average_gpa                 33.52      30.84   1.087 0.277613    
english_average_gpa                  -123.91      30.68  -4.039 6.11e-05 ***
foreign_languages_average_gpa         -67.38      19.89  -3.388 0.000754 ***
mathematics_average_gpa                77.41      21.21   3.650 0.000286 ***
natural_sciences_average_gpa          162.14      40.27   4.026 6.44e-05 ***
social_sciences_history_average_gpa   123.07      39.27   3.134 0.001814 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.09 on 570 degrees of freedom
Multiple R-squared:  0.7542,    Adjusted R-squared:  0.7516 
F-statistic: 291.5 on 6 and 570 DF,  p-value: < 2.2e-16
autoplot(verbalmodel, 1:4, nrow=2, ncol = 2)

Average art/music GPA is not statistically significant for this model so I take it out.

verbalmodel2 <- lm(total_verbal ~  english_average_gpa + foreign_languages_average_gpa + mathematics_average_gpa + natural_sciences_average_gpa + social_sciences_history_average_gpa, data = satscores)
summary(verbalmodel2)

Call:
lm(formula = total_verbal ~ english_average_gpa + foreign_languages_average_gpa + 
    mathematics_average_gpa + natural_sciences_average_gpa + 
    social_sciences_history_average_gpa, data = satscores)

Residuals:
    Min      1Q  Median      3Q     Max 
-96.915  -8.508   2.350  13.505  74.379 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          -105.77      22.51  -4.700 3.27e-06 ***
english_average_gpa                  -129.20      30.30  -4.265 2.34e-05 ***
foreign_languages_average_gpa         -67.03      19.89  -3.370 0.000803 ***
mathematics_average_gpa                71.53      20.51   3.488 0.000524 ***
natural_sciences_average_gpa          170.60      39.52   4.317 1.87e-05 ***
social_sciences_history_average_gpa   142.24      35.09   4.053 5.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.09 on 571 degrees of freedom
Multiple R-squared:  0.7537,    Adjusted R-squared:  0.7516 
F-statistic: 349.5 on 5 and 571 DF,  p-value: < 2.2e-16
autoplot(verbalmodel2, 1:4, nrow=2, ncol = 2)

  • The equation for this model would be: average verbal SAT score = -129.2(average English GPA) - 67(average foreign language gpa) - 71.53(average math gpa) + 170.6(average natural science gpa) + 142.24(average social science/history gpa) - 105.77

  • The p-value of all the explanatory variables in this model is very small which means that this model is statistically significant.

  • The adjusted r-square value is 0.7516 which mean 75.16 % of variation in average verbal SAT scores can be explained by this model.

  • Based on the p-value and the r-sqaure value, this model is pretty good but the diagnostic plot shows some flaws.

Data analysis and Visualziation

First I want to make a simple plot of the SAT scores for the DMV

#filtering for the three states 
dmvsatscores <- satscores |> 
  filter(state_name %in% c("District Of Columbia", "Maryland", "Virginia")) |>
  select(year, state_code,state_name, total_math, total_verbal)
#making the subset data set into a longer format 
dmvsatscores2 <- dmvsatscores |>
  pivot_longer(
    cols = 4:5,
    names_to = "exam_portion",
    values_to = "score"
  )
#plotting for the SAT score for the DMV
p1 <- dmvsatscores2 |> 
  ggplot(aes(x = year, y = score, color = state_name)) + 
  geom_line() + 
  geom_point() + 
  facet_wrap(~exam_portion) +
   scale_color_brewer(palette = "Set2") + 
  theme_bw() + 
  labs(x = "Years",
       y = "Average SAT scores", 
       title = "Average Math and Verbal SAT scores of The DMV From 2005 to 2015",
       caption = "Source: National Center for Education Statistics")
ggplotly(p1)

Final Visualization

Now I want to make my final plot to compare SAT scores based on household income.

#selecting the relevant columns 
householdsatscores <- satscores |> 
  select(year,state_code, state_name, total_math, total_verbal, c(19,21,22,24,25,27,28,30,31,33,34,36) )
#pivoting into longer format 
longhousesat <- householdsatscores |>
  pivot_longer(cols = 6:17,
               names_to = "family_income",
               values_to = "score")

I am going to separate the portion of the SAT exam from the household_income column using code from the GIS assignment. Firstly, I am going to put in a comma after the letter “k” to make splitting easier just like in the previous assignment with the coordinates. Then I will split the column into two right at the comma into two columns.

#separating household income and portion of the SAT test
longhousesat <- longhousesat |>
  mutate(family_income = str_replace_all(family_income, "k_", "k,")) |>
  separate(family_income, into = c("family_income", "sat_portion"), sep = ",", convert = TRUE )
#plotting 
p2 <- longhousesat |>
  ggplot(aes(x = year, y = score, fill = family_income)) + 
  geom_bar(stat = "identity", position = "dodge")+ 
  facet_wrap(~sat_portion)+ 
  scale_fill_brewer(palette = "YlOrBr") + 
  theme_dark() + 
  labs(x = "Year", 
       y = "Average SAT score", 
       title = "Average SAT Score of Different Family Income Brackets", 
       caption = "Source: National Center for Education Statistics", 
       legend = "Family Income Bracket")
ggplotly(p2)

Summary

My final visualization represents the comparison between the average SAT scores for different family income brackets for both the Math and Verbal portion of the exam. Although the visualization is crowded at first glance, we can get crucial information by using the interactivity. This means that we can get much more information by clicking on the legend and selecting two or three family income brackets to compare separately. The interesting pattern from this visualization is that the higher income families almost always tend to get much higher SAT results. This can be because those higher income families have much more resources at their disposal than the lower income families.