Final Project

Author

Arthur Krause Nunes De Almeida

There is a old discussion from the country that I`m from that says that we should invest more in early colleges such as k-12 than in higher education. So I wanted to see how that influences in this data. So the main question is if k-12 attainments is mirrored also by bachelors and graduates. If yes I will try and see how much the investments given in this data-set to see what influences k-12 attainments which should also be mirrored in higher education.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(highcharter)

Warning: package 'highcharter' was built under R version 4.4.2

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use

setwd("C:/Users/Usuário/Documents/Data110")

df <- read.csv("Project3.csv")
df <- head(df, -4) #Removing the last 4 lines since it is weridly formated on the original dataset

Since the Excel file had long and weird names for their variables we have to change for a better and faster use on the plots.

df <- df %>%
  rename( #Using dplyer to change the columns names for future use.
    psstr = Public.School.Student.Teacher.Ratio..K.12
,
    psepp = "Public.School.Expenditures.Per.Pupil..K.12....Dollars.",
    aspst = "Average.Salary.for.Public.School.Teachers..K.12....Dollars.",
       hsa = "High.School.Attainment....",
       bda = "Bachelor.s.Degree.Attainment....",
       gda = "Graduate.Degree.Attainment....",
    acg = "Annual.College.Graduates",
    n2yc = "Number.of.2.Year.Colleges",
    n4ycu = "Number.of.4.Year.Colleges.and.Universities",
    
  ) #Becouse it has long and problematic names for variables renaming them made a huge diference.

Becouse of the way the data was presented we have to remake them as numerical.

df <- df %>%
  mutate(across(c(psstr,psepp,aspst,hsa,bda,gda,acg,n2yc,n4ycu), as.numeric)) #Changimg them back to numeric becouse of the problems in original dataset.

The main question I have is what factor lead to less students missing class. So we will do that buy creating a Boolean variable of high school attainment for if the state is higher than average or below average.

df <- df %>%
  mutate(
    average_value = mean(hsa),  # Calculate average
    higher_than_avg = hsa > average_value  # TRUE for values greater than average, FALSE otherwise
  ) #creating a new boolean variable for higher than average and lower than

model <- lm(gda ~ hsa, data = df)
model1 <- lm(bda ~ hsa, data = df)
summary(model)


Call:
lm(formula = gda ~ hsa, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-5.774 -2.288 -1.201  1.017 21.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.1554    20.1875  -1.048    0.300
hsa           0.3741     0.2246   1.666    0.102

Residual standard error: 4.166 on 49 degrees of freedom
Multiple R-squared:  0.0536,    Adjusted R-squared:  0.03428 
F-statistic: 2.775 on 1 and 49 DF,  p-value: 0.1021

summary(model1)  #Creating the linear regression model for the graph and analysis


Call:
lm(formula = bda ~ hsa, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8136 -4.3154 -0.8099  2.8452 25.9944 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -57.5240    29.7863  -1.931  0.05925 . 
hsa           0.9982     0.3314   3.012  0.00409 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.147 on 49 degrees of freedom
Multiple R-squared:  0.1562,    Adjusted R-squared:  0.139 
F-statistic: 9.074 on 1 and 49 DF,  p-value: 0.004094

Analyzing for both methods especially the p value we can see that high school attainments are not a good meaningful predictor for graduates degree attainments as it is for bachelors. a little different from what I thought in the start. But for a R squared of only 0.139 means that this model even though it is significant there are other factors that influence this since it only explains 13.9%.

lm_model <- lm(gda ~ bda, data = df)
df$predicted_gda <- predict(lm_model, newdata = df)

highchart() %>%
  hc_chart(type = "scatter") %>%
  hc_title(text = "Correlation between Degree Attainments and High School Status") %>% #Creating title
  hc_xAxis(title = list(text = "Bachelor's Degree Attainment")) %>%
  hc_yAxis(title = list(text = "Graduate Degree Attainment")) %>% #Creating a title for x and y axis
  hc_plotOptions(scatter = list(
    marker = list(
      radius = 4,
      symbol = "circle"
    ) #Making both options circles and changing theyr radius
  )) %>%
  hc_add_series(data = df, type = "scatter", hcaes(x = bda, y = gda, group = higher_than_avg)) %>%
  # Add linear regression line
  hc_add_series(data = df, type = "line", hcaes(x = bda, y = predicted_gda), name = "Regression Line", color = "blue", lineWidth = 2) %>%
  hc_tooltip(pointFormat = "High School: {point.group}<br>State: {point.State}<br>Bachelor's Degree: {point.x}<br>Graduate Degree: {point.y}") %>%
  hc_legend(enabled = TRUE)

df <- df %>%
  mutate(Region = case_when(
    State %in% c("New York", "New Jersey", "Massachusetts", "Pennsylvania", "Connecticut", 
                 "Rhode Island", "Vermont", "New Hampshire", "Maine") ~ "Northeast",
    State %in% c("Ohio", "Illinois", "Michigan", "Indiana", "Wisconsin", "Minnesota", "Iowa") ~ "Midwest",
    State %in% c("Texas", "Florida", "Georgia", "North Carolina", "South Carolina", "Alabama", 
                 "Kentucky", "Tennessee", "Louisiana", "Mississippi", "Arkansas", "West Virginia") ~ "South",
    State %in% c("California", "Washington", "Oregon", "Nevada", "Arizona", "Colorado", 
                 "Utah", "Idaho", "Hawaii", "Alaska") ~ "West",
    State == "District of Columbia" ~ "Other",  # Special case for DC
    TRUE ~ "Other"  # Default case for any unlisted state
  ))
#This chunk was made using chat gpt for speed purposes.

ggplot(df, aes(x = Region)) +
  geom_bar(aes(y = psepp, fill = Region), stat = "identity", position = "dodge", width = 0.6) + 
  scale_fill_manual(values = c("Northeast" = "#FF6347", "South" = "#4682B4", "Midwest" = "#32CD32", "West" = "#FFD700", "Other" = "#8A2BE2")) +
  labs(
    title = "High School Investment and Attendance by Region", 
    x = "Region", 
    y = "High School Investment",
    fill = "Region",
    size = "High School Attendance",
    caption = "Data from Maryland Department of Commerce"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.caption = element_text(size = 10, face = "italic")
  ) +
  guides(
    fill = guide_legend(title = "Region"),
    color = guide_legend(title = "Region"),
    size = guide_legend(title = "High School Attendance")
  )

This graph shows the difference in investments to K-12 school across all regions from the United States. This graph is mostly useful for after to compare to the next graph. But its also useful to show the disparity in investments in regions since the Northeast shows almost double the investment in education on the sought

model2 <- lm(hsa ~ psepp, data = df)
df$predicted_hsa <- predict(model2, newdata = df)
#Linear regression for this graph
summary(model2)


Call:
lm(formula = hsa ~ psepp, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8407 -1.5144 -0.0315  2.0205  4.2714 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.671e+01  1.211e+00  71.599  < 2e-16 ***
psepp       2.564e-04  9.476e-05   2.706  0.00934 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 49 degrees of freedom
Multiple R-squared:   0.13, Adjusted R-squared:  0.1123 
F-statistic: 7.323 on 1 and 49 DF,  p-value: 0.009342

hchart(df, "scatter", hcaes(x = psepp, y = hsa, group = State)) %>%
  hc_title(text = "Amount comparing high school attendance and Investment") %>%
  hc_xAxis(title = list(text = "Amount spent per student")) %>%
  hc_yAxis(title = list(text = "High school attendance")) %>%
  hc_add_series(data = df, type = "line", hcaes(x = psepp, y = predicted_hsa), 
                name = "Linear Regression", color = "red", lineWidth = 2)

After analyzing we can reach a semi logical and predictable conclusion of better investment for early high school can help students thought the future such as in bachelors degree. Showing a upward trend on students attending class the more you invest helping also problems such as school evasion that also led to other problems.