Data 605 Discussion Post 11
Load in the Education dataset
library(ggplot2)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v tibble 2.0.1 v purrr 0.2.5
## v tidyr 0.8.2 v dplyr 0.7.8
## v readr 1.3.1 v stringr 1.3.1
## v tibble 2.0.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
setwd("C:/Users/arnou/Documents/Education Data")
df <- read.csv(file="states_education_data.csv", header=TRUE, sep=",")
df$Expenditure_per_child <- df$TOTAL_EXPENDITURE*1000/(df$GRADES_ALL_G)
df$Total_Scores <- (df$AVG_MATH_4_SCORE+df$AVG_MATH_8_SCORE+df$AVG_READING_4_SCORE+df$AVG_READING_8_SCORE)/4
ggplot(data = df, aes(x = Expenditure_per_child, y = Total_Scores)) +
geom_point(color='blue') +
geom_smooth(method = "lm", se = FALSE)+xlim(0,25000)
## Warning: Removed 1073 rows containing non-finite values (stat_smooth).
## Warning: Removed 1073 rows containing missing values (geom_point).
linear_model <- lm(Total_Scores~Expenditure_per_child,df)
plot(linear_model)
linear_model
##
## Call:
## lm(formula = Total_Scores ~ Expenditure_per_child, data = df)
##
## Coefficients:
## (Intercept) Expenditure_per_child
## 2.408e+02 7.304e-04
summary(linear_model)
##
## Call:
## lm(formula = Total_Scores ~ Expenditure_per_child, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.681 -2.878 0.904 4.065 11.230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.408e+02 9.849e-01 244.505 <2e-16 ***
## Expenditure_per_child 7.304e-04 7.822e-05 9.337 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.289 on 418 degrees of freedom
## (1072 observations deleted due to missingness)
## Multiple R-squared: 0.1726, Adjusted R-squared: 0.1706
## F-statistic: 87.19 on 1 and 418 DF, p-value: < 2.2e-16
Conclusion
For each additional thousand dollars that a state spends in tax dollars the average test scores of the students will go up around .72 points, the model is statistically significant. The r-squared value is .17 which means that the expenditure variable explains about 17% of variation in test scores
From the quantile plot we can see that the residuals are normally distributed except on the very extreme ends