There is a old discussion from the country that I`m from that says that we should invest more in early colleges such as k-12 than in higher education. So I wanted to see how that influences in this data. So the main question is if k-12 attainments is mirrored also by bachelors and graduates. If yes I will try and see how much the investments given in this data-set to see what influences k-12 attainments which should also be mirrored in higher education.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
Warning: package 'highcharter' was built under R version 4.4.2
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
setwd("C:/Users/Usuário/Documents/Data110")
df <-read.csv("Project3.csv")df <-head(df, -4) #Removing the last 4 lines since it is weridly formated on the original dataset
Since the Excel file had long and weird names for their variables we have to change for a better and faster use on the plots.
df <- df %>%rename( #Using dplyer to change the columns names for future use.psstr = Public.School.Student.Teacher.Ratio..K.12,psepp ="Public.School.Expenditures.Per.Pupil..K.12....Dollars.",aspst ="Average.Salary.for.Public.School.Teachers..K.12....Dollars.",hsa ="High.School.Attainment....",bda ="Bachelor.s.Degree.Attainment....",gda ="Graduate.Degree.Attainment....",acg ="Annual.College.Graduates",n2yc ="Number.of.2.Year.Colleges",n4ycu ="Number.of.4.Year.Colleges.and.Universities", ) #Becouse it has long and problematic names for variables renaming them made a huge diference.
Becouse of the way the data was presented we have to remake them as numerical.
df <- df %>%mutate(across(c(psstr,psepp,aspst,hsa,bda,gda,acg,n2yc,n4ycu), as.numeric)) #Changimg them back to numeric becouse of the problems in original dataset.
The main question I have is what factor lead to less students missing class. So we will do that buy creating a Boolean variable of high school attainment for if the state is higher than average or below average.
df <- df %>%mutate(average_value =mean(hsa), # Calculate averagehigher_than_avg = hsa > average_value # TRUE for values greater than average, FALSE otherwise ) #creating a new boolean variable for higher than average and lower than
model <-lm(gda ~ hsa, data = df)model1 <-lm(bda ~ hsa, data = df)summary(model)
Call:
lm(formula = gda ~ hsa, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.774 -2.288 -1.201 1.017 21.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -21.1554 20.1875 -1.048 0.300
hsa 0.3741 0.2246 1.666 0.102
Residual standard error: 4.166 on 49 degrees of freedom
Multiple R-squared: 0.0536, Adjusted R-squared: 0.03428
F-statistic: 2.775 on 1 and 49 DF, p-value: 0.1021
summary(model1) #Creating the linear regression model for the graph and analysis
Call:
lm(formula = bda ~ hsa, data = df)
Residuals:
Min 1Q Median 3Q Max
-8.8136 -4.3154 -0.8099 2.8452 25.9944
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.5240 29.7863 -1.931 0.05925 .
hsa 0.9982 0.3314 3.012 0.00409 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.147 on 49 degrees of freedom
Multiple R-squared: 0.1562, Adjusted R-squared: 0.139
F-statistic: 9.074 on 1 and 49 DF, p-value: 0.004094
Analyzing for both methods especially the p value we can see that high school attainments are not a good meaningful predictor for graduates degree attainments as it is for bachelors. a little different from what I thought in the start. But for a R squared of only 0.139 means that this model even though it is significant there are other factors that influence this since it only explains 13.9%.
lm_model <-lm(gda ~ bda, data = df)df$predicted_gda <-predict(lm_model, newdata = df)highchart() %>%hc_chart(type ="scatter") %>%hc_title(text ="Correlation between Degree Attainments and High School Status") %>%#Creating titlehc_xAxis(title =list(text ="Bachelor's Degree Attainment")) %>%hc_yAxis(title =list(text ="Graduate Degree Attainment")) %>%#Creating a title for x and y axishc_plotOptions(scatter =list(marker =list(radius =4,symbol ="circle" ) #Making both options circles and changing theyr radius )) %>%hc_add_series(data = df, type ="scatter", hcaes(x = bda, y = gda, group = higher_than_avg)) %>%# Add linear regression linehc_add_series(data = df, type ="line", hcaes(x = bda, y = predicted_gda), name ="Regression Line", color ="blue", lineWidth =2) %>%hc_tooltip(pointFormat ="High School: {point.group}<br>State: {point.State}<br>Bachelor's Degree: {point.x}<br>Graduate Degree: {point.y}") %>%hc_legend(enabled =TRUE)
df <- df %>%mutate(Region =case_when( State %in%c("New York", "New Jersey", "Massachusetts", "Pennsylvania", "Connecticut", "Rhode Island", "Vermont", "New Hampshire", "Maine") ~"Northeast", State %in%c("Ohio", "Illinois", "Michigan", "Indiana", "Wisconsin", "Minnesota", "Iowa") ~"Midwest", State %in%c("Texas", "Florida", "Georgia", "North Carolina", "South Carolina", "Alabama", "Kentucky", "Tennessee", "Louisiana", "Mississippi", "Arkansas", "West Virginia") ~"South", State %in%c("California", "Washington", "Oregon", "Nevada", "Arizona", "Colorado", "Utah", "Idaho", "Hawaii", "Alaska") ~"West", State =="District of Columbia"~"Other", # Special case for DCTRUE~"Other"# Default case for any unlisted state ))#This chunk was made using chat gpt for speed purposes.
ggplot(df, aes(x = Region)) +geom_bar(aes(y = psepp, fill = Region), stat ="identity", position ="dodge", width =0.6) +scale_fill_manual(values =c("Northeast"="#FF6347", "South"="#4682B4", "Midwest"="#32CD32", "West"="#FFD700", "Other"="#8A2BE2")) +labs(title ="High School Investment and Attendance by Region", x ="Region", y ="High School Investment",fill ="Region",size ="High School Attendance",caption ="Data from Maryland Department of Commerce" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =16, face ="bold"),axis.text.x =element_text(angle =45, hjust =1),plot.caption =element_text(size =10, face ="italic") ) +guides(fill =guide_legend(title ="Region"),color =guide_legend(title ="Region"),size =guide_legend(title ="High School Attendance") )
This graph shows the difference in investments to K-12 school across all regions from the United States. This graph is mostly useful for after to compare to the next graph. But its also useful to show the disparity in investments in regions since the Northeast shows almost double the investment in education on the sought
model2 <-lm(hsa ~ psepp, data = df)df$predicted_hsa <-predict(model2, newdata = df)#Linear regression for this graphsummary(model2)
Call:
lm(formula = hsa ~ psepp, data = df)
Residuals:
Min 1Q Median 3Q Max
-5.8407 -1.5144 -0.0315 2.0205 4.2714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.671e+01 1.211e+00 71.599 < 2e-16 ***
psepp 2.564e-04 9.476e-05 2.706 0.00934 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.472 on 49 degrees of freedom
Multiple R-squared: 0.13, Adjusted R-squared: 0.1123
F-statistic: 7.323 on 1 and 49 DF, p-value: 0.009342
hchart(df, "scatter", hcaes(x = psepp, y = hsa, group = State)) %>%hc_title(text ="Amount comparing high school attendance and Investment") %>%hc_xAxis(title =list(text ="Amount spent per student")) %>%hc_yAxis(title =list(text ="High school attendance")) %>%hc_add_series(data = df, type ="line", hcaes(x = psepp, y = predicted_hsa), name ="Linear Regression", color ="red", lineWidth =2)
After analyzing we can reach a semi logical and predictable conclusion of better investment for early high school can help students thought the future such as in bachelors degree. Showing a upward trend on students attending class the more you invest helping also problems such as school evasion that also led to other problems.