Graduation Rate

Author

Jonnathan Zuna Largo

Assignment 1 - Graduation Rates

I will create a dedicated GitHub repository for this project to store, version, and manage the data and analysis. A central data/ folder will house the CSV file and allow for future dataset additions. This project will live in its own repository, Graduation Rates, to ensure clarity, reproducibility, and separation from other work.

Data Source

The data was found in kaggle Source: https://www.kaggle.com/datasets/rkiattisak/graduation-rate

Data Description

The dataset contains 1,000 observations and is randomly generated. As a result, it does not include any real personal information and is safe for exploration and analysis.

Approach

I will explore relationships between standardized test scores for ACT and SAT and graduation outcomes. The analysis will focus on identifying trends and patterns across additional variables such as parental income, parental education level, and student GPA.

Motivation

This dataset was selected to practice analyzing educational outcomes and to better understand how academic performance and socioeconomic factors may interact in student success metrics.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/JZunaRepo/data/refs/heads/main/graduation_rate.csv"

df <- read.csv(
  file = url
)

glimpse(df)
Rows: 1,000
Columns: 7
$ ACT.composite.score         <int> 22, 29, 30, 33, 29, 28, 29, 30, 27, 32, 28…
$ SAT.total.score             <int> 1625, 2090, 2188, 2151, 2050, 1976, 2097, …
$ parental.level.of.education <chr> "high school", "associate's degree", "bach…
$ parental.income             <int> 40999, 75817, 82888, 93518, 79153, 100048,…
$ high.school.gpa             <dbl> 3.0, 4.0, 4.0, 4.0, 4.0, 3.8, 4.0, 3.7, 3.…
$ college.gpa                 <dbl> 3.1, 3.4, 3.9, 3.7, 3.4, 3.5, 3.4, 3.4, 3.…
$ years.to.graduate           <int> 7, 5, 3, 5, 6, 4, 6, 5, 4, 3, 5, 5, 5, 6, …

Exploratory Data Analysis

First 10 rows of the data set

head(df,10)
   ACT.composite.score SAT.total.score parental.level.of.education
1                   22            1625                 high school
2                   29            2090          associate's degree
3                   30            2188           bachelor's degree
4                   33            2151          associate's degree
5                   29            2050          associate's degree
6                   28            1976             master's degree
7                   29            2097                some college
8                   30            1976                some college
9                   27            2072           bachelor's degree
10                  32            2246           bachelor's degree
   parental.income high.school.gpa college.gpa years.to.graduate
1            40999             3.0         3.1                 7
2            75817             4.0         3.4                 5
3            82888             4.0         3.9                 3
4            93518             4.0         3.7                 5
5            79153             4.0         3.4                 6
6           100048             3.8         3.5                 4
7            46883             4.0         3.4                 6
8            67379             3.7         3.4                 5
9           102424             3.9         3.9                 4
10           56793             4.0         3.6                 3

Checking for any nulls

any(is.na(df))
[1] FALSE

Subset

df %>%
  group_by('parental.level.of.education') %>%
  summarise(
    avg_income = mean(`parental.income`),
    avg_ACT = mean(`ACT.composite.score`),
    avg_SAT = mean(`SAT.total.score`),
    avg_hs_GPA = mean(`high.school.gpa`),
    avg_college_GPA = mean(`college.gpa`)
    )
# A tibble: 1 × 6
  "parental.level.of.edu…¹ avg_income avg_ACT avg_SAT avg_hs_GPA avg_college_GPA
  <chr>                         <dbl>   <dbl>   <dbl>      <dbl>           <dbl>
1 parental.level.of.educa…     67378.    28.6   2000.       3.71            3.38
# ℹ abbreviated name: ¹​`"parental.level.of.education"`
ggplot(df, aes(
  x = `parental.level.of.education`,
  y = `parental.income`
)) +
  geom_boxplot() +
  labs(
    title = "Parental Income by Parental Level of Education",
    x = "Parental Level of Education",
    y = "Parental Income"
  )

ggplot(df, aes(
  x = `parental.level.of.education`,
  y = `ACT.composite.score`
)) +
  geom_boxplot() +
  labs(
    title = "ACT Scores by Parental Level of Education",
    x = "Parental Level of Education",
    y = "ACT Score"
  ) 

ggplot(df, aes(
  x = `parental.level.of.education`,
  y = `SAT.total.score`
)) +
  geom_boxplot() +
  labs(
    title = "SAT Scores by Parental Level of Education",
    x = "Parental Level of Education",
    y = "SAT Score"
  )

Conclusion

The analysis shows that higher parental level of education is associated with higher mean SAT and ACT scores, this suggests a positive relationship between family educational background and standardized test performance. But, due to the nature of the data (generated data), these findings should only be interpreted as an illustrative trend rather than true evidence of real world examples. For Further analysis, if there are any correlations observed in real world data, statistical analysis such as correlation test or regression models would be needed to formally assess the strength and significance of those relationships